US20210357816A1 - System with hybrid communication strategy for large-scale distributed deep learning - Google Patents
System with hybrid communication strategy for large-scale distributed deep learning Download PDFInfo
- Publication number
- US20210357816A1 US20210357816A1 US17/386,750 US202117386750A US2021357816A1 US 20210357816 A1 US20210357816 A1 US 20210357816A1 US 202117386750 A US202117386750 A US 202117386750A US 2021357816 A1 US2021357816 A1 US 2021357816A1
- Authority
- US
- United States
- Prior art keywords
- module
- gcs
- communication scheme
- modules
- operator graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F02—COMBUSTION ENGINES; HOT-GAS OR COMBUSTION-PRODUCT ENGINE PLANTS
- F02C—GAS-TURBINE PLANTS; AIR INTAKES FOR JET-PROPULSION PLANTS; CONTROLLING FUEL SUPPLY IN AIR-BREATHING JET-PROPULSION PLANTS
- F02C7/00—Features, components parts, details or accessories, not provided for in, or of interest apart form groups F02C1/00 - F02C6/00; Air intakes for jet-propulsion plants
- F02C7/04—Air intakes for gas-turbine plants or jet-propulsion plants
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1605—Handling requests for interconnection or transfer for access to memory bus based on arbitration
- G06F13/161—Handling requests for interconnection or transfer for access to memory bus based on arbitration with latency improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1668—Details of memory controller
- G06F13/1689—Synchronisation and timing concerns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G06N3/0454—
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F05—INDEXING SCHEMES RELATING TO ENGINES OR PUMPS IN VARIOUS SUBCLASSES OF CLASSES F01-F04
- F05D—INDEXING SCHEME FOR ASPECTS RELATING TO NON-POSITIVE-DISPLACEMENT MACHINES OR ENGINES, GAS-TURBINES OR JET-PROPULSION PLANTS
- F05D2230/00—Manufacture
- F05D2230/50—Building or constructing in particular ways
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F05—INDEXING SCHEMES RELATING TO ENGINES OR PUMPS IN VARIOUS SUBCLASSES OF CLASSES F01-F04
- F05D—INDEXING SCHEME FOR ASPECTS RELATING TO NON-POSITIVE-DISPLACEMENT MACHINES OR ENGINES, GAS-TURBINES OR JET-PROPULSION PLANTS
- F05D2230/00—Manufacture
- F05D2230/70—Disassembly methods
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F05—INDEXING SCHEMES RELATING TO ENGINES OR PUMPS IN VARIOUS SUBCLASSES OF CLASSES F01-F04
- F05D—INDEXING SCHEME FOR ASPECTS RELATING TO NON-POSITIVE-DISPLACEMENT MACHINES OR ENGINES, GAS-TURBINES OR JET-PROPULSION PLANTS
- F05D2230/00—Manufacture
- F05D2230/72—Maintenance
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F05—INDEXING SCHEMES RELATING TO ENGINES OR PUMPS IN VARIOUS SUBCLASSES OF CLASSES F01-F04
- F05D—INDEXING SCHEME FOR ASPECTS RELATING TO NON-POSITIVE-DISPLACEMENT MACHINES OR ENGINES, GAS-TURBINES OR JET-PROPULSION PLANTS
- F05D2230/00—Manufacture
- F05D2230/80—Repairing, retrofitting or upgrading methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/142—Network analysis or design using statistical or mathematical methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Definitions
- DL systems exhibit a high degree of model complexity, with many parameters in deeply layered structures that usually require a large amount of computing resources to train in their machine learning models.
- This training process involves processing a huge amount of data on different types of hardware such as graphics processing units (GPUs).
- GPUs graphics processing units
- the high computational cost of DL programs on large-scale data makes these programs ideal to be executed in a distributed fashion (by using multiple computers, each with their own GPUs and in communication with each other over a network) to be efficient.
- One embodiment is directed to a computer in a distributed computing system including a graphics processing unit (GPU) memory; a central processing unit (CPU) memory comprising a Key-Value Store (KVS) module; an execution engine module configured to run a deep learning (DL) program to create a plurality of operator graph layers in the graphics processing unit memory; a client library module configured to create a GPU-CPU synchronization (GCS) module for each of the plurality of operator graph layers; a coordination service module configured to compute network cost of a first and a second communication scheme and select, based on the network cost, one of the first and second communication scheme for transmitting data associated with one of the plurality of operator graph layers from a corresponding GCS module; and wherein the client library module is further configured to initiate a data transfer from the GCS module using the selected communication scheme.
- GCS GPU-CPU synchronization
- Another embodiment is directed to A method of running a DL program including the steps of: parsing DL program code; constructing a plurality of operator graph layers in a GPU memory; creating a GCS module for each of the operator graph layers; activating a KVS module in a CPU memory; computing the network cost of a first and a second communication schemes for transmitting data; for each GCS module, selecting one of the communication schemes based on the network cost; and transmitting data from each GCS module using the selected communication scheme; wherein at least one GCS module uses the first communication scheme and at least one GCS module uses the second communication scheme.
- the construction of the operator graphs layers can happen simultaneously across all the computers 102 , 104 , 106 in the distributed system 100 when the system starts the DL program.
- the two operator graphs can be represented as a stack of operator graph layers 132 , 134 , 136 , where each layer contains both model parameters and intermediate values required by the DL program.
- the coordination service module 101 can use a formula to calculate the network cost of each of the two transmission schemes: (A) transmitting the layer parameters and intermediate values of a GCS module 122 to the KVS module 142 and on to the GCS modules 156 , 166 on the other computers 104 , 106 (KVS-to-GCS), and (B) broadcasting the layer parameters and intermediate values of the GCS module 124 to all other replica GCS modules 154 , 164 representing the same layer in the other computers 104 , 106 (GCS-to-GCS broadcast).
- the coordination service module 101 can trigger the associated CGS module 124 on the first computer 102 to begin communication with the corresponding CGS modules 154 , 164 on computers 104 , 106 , respectively. This spreads out the communication load for a single input datum across time, thus preventing network communication bottlenecks that could slow down the running of the distributed DL program.
- KVS-to-GCS transmission scheme A
- the client library module 116 can facilitate data exchange through the KVS module 142 as soon as the computation for a given layer is completed.
- FIG. 3 illustrates the exemplary components of a computer 10 which can be any of the computers 102 , 104 , 106 in the distributed DL system 100 of FIG. 1 .
- the computer 10 can include a central processing unit (CPU) 11 , memory 12 storing one or more applications 17 , an input unit 13 , a display unit 14 , and a network interface 15 , all connected to a bus 16 .
- the network interface 15 allows the computer to connect to a network 20 .
- the one or more illustrated modules can be stored in memory 12 .
- Memory 12 can include both a GPU memory and a CPU memory.
- the input unit 13 can receive use input or data.
- the network interface 15 allows computer to communicate with one or more of the other computers on the network. Such communication may employ one of the two schemes in the above-disclosed hybrid communication strategy.
- computer program product may be used generally to refer to media such as, memory storage devices, or storage unit. These, and other forms of computer-readable media, may be involved in storing one or more instructions for use by processor to cause the processor to perform specified operations. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Chemical & Material Sciences (AREA)
- Combustion & Propulsion (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mechanical Engineering (AREA)
- Multi Processors (AREA)
Abstract
A computer in a distributed computing system is disclosed. The computer includes: a graphics processing unit (GPU) memory; a central processing unit (CPU) memory comprising a Key-Value Store (KVS) module; an execution engine module configured to run a deep learning (DL) program to create a plurality of operator graph layers in the graphics processing unit memory; a client library module configured to create a GPU-CPU synchronization (GCS) module for each of the plurality of operator graph layers; a coordination service module configured to compute network cost of a first and a second communication scheme and select, based on the network cost, one of the first and second communication scheme for transmitting data associated with one of the plurality of operator graph layers from a corresponding GCS module.
Description
- The present invention generally relates to distributed computing systems, and more particularly, is directed to a method and system of facilitating communications between multiple computers when executing a large-scale program such as a deep learning (DL) program that requires a huge amount of computational power to run efficiently.
- A distributed computing system (or “distributed system”) is a model in which components located on networked computers communicate and coordinate their actions by passing messages. Distributed systems are widely used to run programs that require a large amount of computational power to execute. Such programs can be referred to as “distributed programs” hereinafter. One type of such programs is machine learning (ML) programs. Machine learning (ML) allows computers to learn to perform certain tasks without being explicitly programmed. One type of advanced ML is deep learning (DL), which is based on learning data representations. DL has been used for performing a wide spectrum of tasks, including speech recognition, visual recognition, and language understanding. Typically, DL systems exhibit a high degree of model complexity, with many parameters in deeply layered structures that usually require a large amount of computing resources to train in their machine learning models. This training process involves processing a huge amount of data on different types of hardware such as graphics processing units (GPUs). The high computational cost of DL programs on large-scale data makes these programs ideal to be executed in a distributed fashion (by using multiple computers, each with their own GPUs and in communication with each other over a network) to be efficient.
- The presently disclosed embodiments are directed to solving issues relating to one or more of the problems presented in the prior art, as well as providing additional features that will become readily apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings.
- One embodiment is directed to a computer in a distributed computing system including a graphics processing unit (GPU) memory; a central processing unit (CPU) memory comprising a Key-Value Store (KVS) module; an execution engine module configured to run a deep learning (DL) program to create a plurality of operator graph layers in the graphics processing unit memory; a client library module configured to create a GPU-CPU synchronization (GCS) module for each of the plurality of operator graph layers; a coordination service module configured to compute network cost of a first and a second communication scheme and select, based on the network cost, one of the first and second communication scheme for transmitting data associated with one of the plurality of operator graph layers from a corresponding GCS module; and wherein the client library module is further configured to initiate a data transfer from the GCS module using the selected communication scheme.
- Another embodiment is directed to A method of running a DL program including the steps of: parsing DL program code; constructing a plurality of operator graph layers in a GPU memory; creating a GCS module for each of the operator graph layers; activating a KVS module in a CPU memory; computing the network cost of a first and a second communication schemes for transmitting data; for each GCS module, selecting one of the communication schemes based on the network cost; and transmitting data from each GCS module using the selected communication scheme; wherein at least one GCS module uses the first communication scheme and at least one GCS module uses the second communication scheme.
- Further features and advantages of the present disclosure, as well as the structure and operation of various embodiments of the present disclosure, are described in detail below with reference to the accompanying drawings.
- The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The drawings are provided for purposes of illustration only and merely depict exemplary embodiments of the disclosure. These drawings are provided to facilitate the reader's understanding of the disclosure and should not be considered limiting of the breadth, scope, or applicability of the disclosure. It should be noted that for clarity and ease of illustration these drawings are not necessarily made to scale.
-
FIG. 1 is a block diagram illustrating the exemplary components of multiple computers on a distributed computer system, according to embodiments of the invention; -
FIG. 2 is a flow chart illustrating the exemplary steps in the process of running a DL program on the distributed computer system ofFIG. 1 , according to embodiments of the invention; and -
FIG. 3 is a block diagram illustrating an exemplary computer in which embodiments of the invention can be implemented. - The following description is presented to enable a person of ordinary skill in the art to make and use the invention. Descriptions of specific devices, techniques, and applications are provided only as examples. Various modifications to the examples described herein will be readily apparent to those of ordinary skill in the art, and the general principles defined herein may be applied to other examples and applications without departing from the spirit and scope of the invention. Thus, embodiments of the present invention are not intended to be limited to the examples described herein and shown, but is to be accorded the scope consistent with the claims.
- The word “exemplary” is used herein to mean “serving as an example or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs.
- Reference will now be made in detail to aspects of the subject technology, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
- It should be understood that the specific order or hierarchy of steps in the processes disclosed herein is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged while remaining within the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
- Current systems for executing DL programs either do not support distributed execution across multiple computers, or even when they do, offer poor performance due to the cost of model parameter update synchronization between the multiple computers in the distributed system (or on the distributed network). In particular, the high computational throughput of GPUs that are now commonly used to run DL programs allows more data to be processed per minute, leading to a greater need to synchronize information across all the computers on the network. This need grows with every new computer added to the distributed network. In the worst-case scenario, the DL program is executed with no improvement or even decrease in speed despite having more computers in the distributed system. Thus, a solution is needed to improve synchronization among the computers of a distributed system.
- Described herein is a system with a hybrid communication strategy for synchronizing information across multiple computers when executing a resource-intensive program such as a DL program. In one embodiment, the inventive system provides (1) a DL execution engine that executes the DL program code on distributed computing devices and, while executing the DL program code, computes model parameter updates that are applied to the mathematical model of DL program, and (2) a coordination service module that relies on a hybrid communication strategy to exchange model parameter updates between any two computers in the distributed system. The hybrid communication strategy provides at least two distinct communication strategies for transmitting program data between computers during the execution of the DL program. Typically, the more efficient communication strategy can be selected based, for example, on the number of computers in the distributed system, the matrix dimensions associated with a particular pair of operator graph layer. Different communication strategies can be selected for synchronizing data associated with different pairs of operator graph layers. Specific embodiments of the distributed system and the hybrid communication strategy for a DL program are discussed in detail below with reference to
FIGS. 1-3 . -
FIG. 1 illustratesmultiple computers distributed computing system 100 and the exemplary components of these computers. Specifically, as shown, thedistributed computing system 100 can be a distributed DL system. It should, however, be understood that a similar system architecture can be utilized to run other type of programs that require large amount of computing resources across multiple computers. - As illustrated in
FIG. 1 , the left side of thedotted line 103 shows the exemplary components of afirst computer 102 in thedistributed DL system 100. A second andthird computers dotted line 103. The illustration of the second andthird computers third computers first computer 102. Many of these components/modules are omitted inFIG. 1 for clarity purpose. In some embodiments, all computers in the distributed system can be identical. - The
first computer 102 can include anexecution engine module 110 that can run programs such as aDL program 112 oninput data 114 made available to the program. For example, theexecution engine module 110 can parse theDL program code 112 into one or more mathematical operator graphs, which are data structure representations of the mathematical loss function described by theDL programs 112. Specifically, theexecution engine module 110 can perform automatic differentiation of a loss function (represented as an operator graph) to produce a first derivative of the loss function (also represented as an operator graph). when executing theDL program 112 on eachcomputer execution engine module 110 can readinput data 114 one datum at a time, and populates the loss function and first derivative operator graphs with appropriate values derived from the current input datum. In one embodiment, this can be done according to the back propagation algorithm. This process is usually referred as “computing the operator graphs”. The final output of this computation can be a collection of evaluations and gradients (first derivatives) for each of the model parameters in the DL program with respect to the input datum. - The
first computer 102 can also include two different types of memories: graphics processing unit (GPU)memory 130 and central processing unit (CPU)memory 140. The memories are for storing different types of data to be processed by the corresponding processing unit. Theexecution engine module 110 can communicate with both theGPU memory 130 and theCPU memory 140 through aclient library module 116. Theexecution module 110 can allocate memory space from theCPU memory 140 and theGPU memory 130 on thecomputer 102. TheGPU memory 130 can be used to store, for example, the loss function operator graph and the first derivative operator graph representing mathematical loss functions described by the DL programs 112. As will be discussed below, the operator graphs can be replicated across everycomputer system 100. In this embodiment, the construction of the operator graphs layers can happen simultaneously across all thecomputers system 100 when the system starts the DL program. Because the DL program can specify multi-layered mathematical models, the two operator graphs can be represented as a stack of operator graph layers 132, 134, 136, where each layer contains both model parameters and intermediate values required by the DL program. - The
client library module 116 can provide an interface between theexecution engine module 100 and the other modules (e.g.,GPU memory 130 and CPU memory 140) in thefirst computer 102. Theclient library module 116 can also create a GPU-CPU Synchronization (GCS)module layer GPU memory 130. TheGCS modules computer system 100. TheGCS modules - After the
client library module 116 creates theGCS modules module 142 can be activated in theCPU memory 140. TheKVS module 142 can provide one channel of data synchronization across two computers in the distributedsystem 100 when one specific synchronization strategy is selected. In addition, the Key-Value Store (KVS)module 142 can provide a Distributed Shared Memory interface (not shown inFIG. 1 ) to access certain module layer parameters and intermediate values (as determined by the hybrid communication strategy of the coordination service module 101). TheKVS module 142 can be spread over allcomputers GCS modules KVS modules KVS modules KVS module GCS module - Each
GCS KVS module 142 or to itsreplica GCS modules other computers - The
first computer 102 can further include acoordination service module 101. When the DL program is started, thecoordination service module 101 can collect information about the operating environment including, for example, cluster information (such as the number of computers, number of GPUs per computer, and their network addresses), the configuration of the operator graphs (e.g., number of layers, type of layers, number of neurons per layer, connectivity pattern between layers, etc.). Using the collected information, thecoordination service module 101 can set up a hybrid communication strategy for synchronizing data across computers. In one example, the hybrid communication strategy can include broadcasting data directly fromGCS modules 124 on onecomputer 102 to the corresponding GCS modules onother computers 104, 106 (GCS-to-GCS broadcast) to synchronize some operator graph layers acrosscomputers KVS module 142 on thesame computer 102 to thecorresponding GCS modules other computers 104, 106 (KVS-to-GCS communication). - For each
operator graph layer corresponding GCS module coordination service module 101 can use a formula to calculate the network cost of each of the two transmission schemes: (A) transmitting the layer parameters and intermediate values of aGCS module 122 to theKVS module 142 and on to theGCS modules other computers 104, 106 (KVS-to-GCS), and (B) broadcasting the layer parameters and intermediate values of theGCS module 124 to all otherreplica GCS modules other computers 104, 106 (GCS-to-GCS broadcast). As an example, one formula to calculate the network cost for transmission scheme (A) can be as follows: assume P is the number of worker machines, M and N are the matrix dimensions (column and row, respectively) of the operator graph layer, and the communication cost can be estimated as the product of P, M, and N (i.e., PMN). On the other hand, to calculate the network cost for transmission scheme (B), the formula can be P{circumflex over ( )}2 B (M+N), where B is the batch size, which is the number of data samples (images, table rows, sentences, etc.) processed per network communication attempt. Typically, B is an integer that is at least 1. - The
coordination service module 101 determines the less costly alternative, and in the case of A being the less costly alternative, thecoordination service module 101 configures theGCS module 122 to communicate with theKVS module 142. In contrast, in the case of B being the less costly alternative, thecoordination service module 101 configures theGCS module 124 to communicate via broadcast directly to all itsreplica CGS modules - In this embodiment, each GCS module can accept at least three commands: Send, Receive and Move. The Move command is used to synchronize the contents of the operator graphs between the CPU memory and GPU memory. The Send and Receive commands are to synchronize the contents of operator graphs across different computers, either through the KVS module or through direct communication with replica GCS modules on other computers.
- The
client library module 116 can include a distributed directory including the addresses (e.g., IP addresses) of other computers in the distributedsystem 101. Theclient library module 116 can set up network send and receive ports for theKVS module 142 and theGCS modules client library module 116 can also manages communications between theCGS modules different computers coordination service module 101. When theexecution engine module 110 is processing an input datum with theDL program 112, the computation proceeds sequentially across the layers of the loss function operator graph, followed by the layers of the first derivative operator graph. As soon as the computation for a given layer is completed, thecoordination service module 101 can trigger the associatedCGS module 124 on thefirst computer 102 to begin communication with thecorresponding CGS modules computers coordination service module 101 to be the less costly synchronization strategy, theclient library module 116 can facilitate data exchange through theKVS module 142 as soon as the computation for a given layer is completed. - Although
FIG. 1 only illustrates aKVS module 152 andmultiple GCS modules third computers third computers first computer 101. For example, each of thecomputers computer computer 102. Eachcomputer KVS module 152. The coordination service module in eachcomputer respective computer computers computer 102 not only in their internal module structures, but also in how their modules operate and communicate with the other computers on the distributed network. -
FIG. 2 illustrates the exemplary steps in running a DL program over a distributed system such as thesystem 100 ofFIG. 1 including multiple computers. It should be understood that the process of running the program may include other steps not shown in the flow chart ofFIG. 2 . Prior to starting the DL program on the distributed system, the DL program code is loaded onto each computer. In response to a command to start the DL program, the execution engine module on each computer parses the DL program code (step 201). Then, the execution engine modules can construct loss function and first derivative operator graphs to be stored in the GPU memories of their respective computers (step 202). The operator graphs can be stored in the GPU memories as operator graph layers, as discussed above with reference toFIG. 1 . The client library module on each computer then creates a GCS module for each layer in the operator graphs (step 203). Thereafter, the KVS module on each computer can be initialized (step 204). - The coordination service module on each computer can then compute the network cost of each GCS module under the two different communication schemes discussed above with reference to
FIG. 1 (step 205). Specifically, one of the schemes has a GCS module on one computer communicate with a GCS module on a second computer by using a KVS module as an intermediary. The other scheme involves the GCS modules broadcasting directly to other GCS modules on other computers. The client library modules can then set up network communication (e.g., send/receive) ports for the GCS modules and the KVS modules to communicate with other computers on the distributed network (step 206). - A determination can then be made regarding whether the DL program has completed (step 207). If the DL program has completed, the program's model parameters is output (step 208). If the DL program has not completed, the execution engine module on each company can read the next input datum (step 209) and populate the two operator graphs' model parameters and intermediate values according to the input datum (step 210). The client library module then triggers each GCS module to begin information exchange over the network using one of the communication schemes as decided by the coordination service module (step 211). In one embodiment, this can take place as soon as the information to be communicated across the network is made available to the execution module. This information is then transmitted from either a GCS module or a KVS module on one computer to another computer over the distributed network via the selected communication scheme (step 212). The execution engine module on each computer can then calculate the parameter updates from the information received from other computers and apply them to its operator graphs' model parameters (step 213). Once the parameters are updated, the computer can check to see if the DL program has completed (step 207) and repeat the same steps 209-213 if the program is still running.
- In other embodiments where the system is designed to run programs other than DL programs, the
execution engine module 110 can execute other program code using other input data. -
FIG. 3 illustrates the exemplary components of acomputer 10 which can be any of thecomputers DL system 100 ofFIG. 1 . Thecomputer 10 can include a central processing unit (CPU) 11,memory 12 storing one ormore applications 17, aninput unit 13, adisplay unit 14, and a network interface 15, all connected to a bus 16. The network interface 15 allows the computer to connect to a network 20. In a computer such as the ones shown inFIG. 1 , the one or more illustrated modules can be stored inmemory 12.Memory 12 can include both a GPU memory and a CPU memory. Theinput unit 13 can receive use input or data. The network interface 15 allows computer to communicate with one or more of the other computers on the network. Such communication may employ one of the two schemes in the above-disclosed hybrid communication strategy. - While various embodiments of the invention have been described above, it should be understood that they have been presented by way of example only, and not by way of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the disclosure, which is done to aid in understanding the features and functionality that can be included in the disclosure. The disclosure is not restricted to the illustrated example architectures or configurations, but can be implemented using a variety of alternative architectures and configurations. Additionally, although the disclosure is described above in terms of various exemplary embodiments and implementations, it should be understood that the various features and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described. They instead can be applied alone or in some combination, to one or more of the other embodiments of the disclosure, whether or not such embodiments are described, and whether or not such features are presented as being a part of a described embodiment. Thus the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments.
- In this document, the term “module” as used herein, refers to software, firmware, hardware, and any combination of these elements for performing the associated functions described herein. Additionally, for purpose of discussion, the various modules are described as discrete modules; however, as would be apparent to one of ordinary skill in the art, two or more modules may be combined to form a single module that performs the associated functions according embodiments of the invention.
- In this document, the terms “computer program product”, “computer-readable medium”, and the like, may be used generally to refer to media such as, memory storage devices, or storage unit. These, and other forms of computer-readable media, may be involved in storing one or more instructions for use by processor to cause the processor to perform specified operations. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system.
- It will be appreciated that, for clarity purposes, the above description has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processors or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.
- Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing: the term “including” should be read as meaning “including, without limitation” or the like; the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof; and adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known”, and terms of similar meaning, should not be construed as limiting the item described to a given time period, or to an item available as of a given time. But instead these terms should be read to encompass conventional, traditional, normal, or standard technologies that may be available, known now, or at any time in the future. Likewise, a group of items linked with the conjunction “and” should not be read as requiring that each and every one of those items be present in the grouping, but rather should be read as “and/or” unless expressly stated otherwise. Similarly, a group of items linked with the conjunction “or” should not be read as requiring mutual exclusivity among that group, but rather should also be read as “and/or” unless expressly stated otherwise. Furthermore, although items, elements or components of the disclosure may be described or claimed in the singular, the plural is contemplated to be within the scope thereof unless limitation to the singular is explicitly stated. The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to”, or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.
- Additionally, memory or other storage, as well as communication components, may be employed in embodiments of the invention. It will be appreciated that, for clarity purposes, the above description has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processing logic elements or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processing logic elements or controllers may be performed by the same processing logic element or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.
- Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by, for example, a single unit or processing logic element. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined. The inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather the feature may be equally applicable to other claim categories, as appropriate.
Claims (20)
1. A distributed computing system comprising a computer comprising:
a graphics processing unit (GPU) memory;
a central processing unit (CPU) memory comprising a Key-Value Store (KVS) module;
an execution engine module configured to run a deep learning (DL) program to create a plurality of operator graph layers in the graphics processing unit memory;
a client library module configured to create a GPU-CPU synchronization (GCS) module for each of the plurality of operator graph layers;
a coordination service module configured to compute network cost of a first and a second communication scheme and select, based on the network cost, one of the first and second communication scheme for transmitting data associated with one of the plurality of operator graph layers from a corresponding GCS module; and
wherein the client library module is further configured to initiate a data transfer from the GCS module using the selected communication scheme.
2. The system of claim 1 , wherein the first communication scheme comprises broadcasting data associated with the one of the plurality of operator graph layers from the corresponding GCS module to one or more GCS modules directly.
3. The system of claim 2 , wherein the network cost associated with the first communication scheme can be computed as P 2 B (M+N), wherein P is a number of computers in the distributed system, B is a batch size, M and N are dimensions of a matrix associated with the operator graph layer.
4. The system of claim 1 , wherein the second communication scheme comprises using the KVS module as an intermediary to transmit data from one GCS to another GCS.
5. The system of claim 4 , wherein the network cost associated with the second communication scheme can be computed as PMN, wherein P is a number of computers in the distributed system, M and N are dimensions of a matrix associated with the operator graph layer.
6. The system of claim 1 , wherein the client library module is further configured to create send and receive ports for each of the plurality of GCS modules.
7. The system of claim 1 , wherein the execution engine module running the DL program comprising populating two operator graphs' model parameters and intermediate values according to input datum.
8. The system of claim 7 , wherein the execution engine module is configured to populate the model parameters and intermediate values according to back propagation algorithm.
9. The system of claim 1 , wherein at least one of the GCS modules is in communication with the KVS module.
10. The system of claim 1 , wherein at least one of the GCS modules is configured to receive data from another GCS module directly.
11. The system of claim 1 , wherein at least one of the GCS modules is configured to receive data from a KVS module.
12. A method of running a Deep Learning (DL) program comprising:
parsing DL program code;
constructing a plurality of operator graph layers in a GPU memory;
creating a GCS module for each of the operator graph layers;
activating a KVS module in a CPU memory;
computing the network cost of a first and a second communication schemes for transmitting data;
for each GCS module, selecting one of the communication schemes based on the network cost; and
transmitting data from each GCS module using the selected communication scheme;
wherein at least one GCS module uses the first communication scheme and at least one GCS module uses the second communication scheme.
13. The method of claim 12 , where transmitting data using the first communication scheme comprises broadcasting data associated with the one of the plurality of operator graph layers from the corresponding GCS module to one or more other GCS modules directly.
14. The method of claim 13 , wherein the network cost associated with the first communication scheme is computed as P 2 B (M+N), wherein P is a number of computers in the distributed system, B is a batch size, M and N are dimensions of a matrix associated with the operator graph layer.
15. The method of claim 12 , wherein transmitting data using the second communication scheme comprises using the KVS module as an intermediary to transmit data from one GCS to another GCS.
16. The method of claim 15 , wherein the network cost associated with the second communication scheme is computed as PMN, wherein P is a number of computers in the distributed system, M and N are dimensions of a matrix associated with the operator graph layer.
17. The method of claim 12 , further comprising creating send and receive ports for each of the plurality of GCS modules.
18. The method of claim 12 , wherein parsing the DL code comprises populating two operator graphs' model parameters and intermediate values according to input datum.
19. The method of claim 12 , further comprising at least one of the GCS modules receiving data from another GCS module directly.
20. The method of claim 12 , further comprising at least one of the GCS modules receiving data from a KVS module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/386,750 US20210357816A1 (en) | 2017-05-10 | 2021-07-28 | System with hybrid communication strategy for large-scale distributed deep learning |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762504473P | 2017-05-10 | 2017-05-10 | |
US15/814,394 US11106998B2 (en) | 2017-05-10 | 2017-11-16 | System with hybrid communication strategy for large-scale distributed deep learning |
US17/386,750 US20210357816A1 (en) | 2017-05-10 | 2021-07-28 | System with hybrid communication strategy for large-scale distributed deep learning |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/814,394 Continuation US11106998B2 (en) | 2017-05-10 | 2017-11-16 | System with hybrid communication strategy for large-scale distributed deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
US20210357816A1 true US20210357816A1 (en) | 2021-11-18 |
Family
ID=67688591
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/814,394 Active 2040-06-23 US11106998B2 (en) | 2017-05-10 | 2017-11-16 | System with hybrid communication strategy for large-scale distributed deep learning |
US17/386,750 Pending US20210357816A1 (en) | 2017-05-10 | 2021-07-28 | System with hybrid communication strategy for large-scale distributed deep learning |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/814,394 Active 2040-06-23 US11106998B2 (en) | 2017-05-10 | 2017-11-16 | System with hybrid communication strategy for large-scale distributed deep learning |
Country Status (1)
Country | Link |
---|---|
US (2) | US11106998B2 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102197247B1 (en) * | 2017-06-01 | 2020-12-31 | 한국전자통신연구원 | Parameter server and method for sharing distributed deep learning parameter using the same |
WO2019209154A1 (en) * | 2018-04-27 | 2019-10-31 | Sony Mobile Communications Ab | Mechanism for machine learning in distributed computing |
CN113449842A (en) * | 2020-03-27 | 2021-09-28 | 华为技术有限公司 | Distributed automatic differentiation method and related device |
US11848980B2 (en) * | 2020-07-09 | 2023-12-19 | Boray Data Technology Co. Ltd. | Distributed pipeline configuration in a distributed computing system |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160065483A1 (en) * | 2014-09-03 | 2016-03-03 | Fujitsu Limited | Communication system, control apparatus, and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11488008B2 (en) * | 2017-05-05 | 2022-11-01 | Intel Corporation | Hardware implemented point to point communication primitives for machine learning |
-
2017
- 2017-11-16 US US15/814,394 patent/US11106998B2/en active Active
-
2021
- 2021-07-28 US US17/386,750 patent/US20210357816A1/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160065483A1 (en) * | 2014-09-03 | 2016-03-03 | Fujitsu Limited | Communication system, control apparatus, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
US11106998B2 (en) | 2021-08-31 |
US20180330276A1 (en) | 2018-11-15 |
US20190266515A9 (en) | 2019-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210357816A1 (en) | System with hybrid communication strategy for large-scale distributed deep learning | |
EP3158529B1 (en) | Model parallel processing method and apparatus based on multiple graphic processing units | |
EP3058690B1 (en) | System and method for creating a distributed transaction manager supporting repeatable read isolation level in a mpp database | |
CN107944566B (en) | Machine learning method, main node, working node and system | |
EP2962226A1 (en) | System and method for distributed sql join processing in shared-nothing relational database clusters using stationary tables | |
US11252260B2 (en) | Efficient peer-to-peer architecture for distributed machine learning | |
CN108009642A (en) | Distributed machines learning method and system | |
CN105956666B (en) | A kind of machine learning method and system | |
US20140279986A1 (en) | System and Method for Performing a Transaction in a Massively Parallel Processing Database | |
CN106843828A (en) | interface display, loading method and device | |
US20190124155A1 (en) | Methods and apparatus for iterative nonspecific distributed runtime architecture and its application to cloud intelligence | |
US20150012583A1 (en) | Methods, Devices and Systems for Dynamically Managing Memberships in Replicated State Machines Within a Distributed Computing Environment | |
CN112149808B (en) | Method, system and medium for expanding stand-alone graph neural network training to distributed training | |
EP2330525A1 (en) | Parallel computing method and computing platform for security and stability analysis of large power grid | |
US11468325B2 (en) | Multi-model training pipeline in distributed systems | |
CN108306804A (en) | A kind of Ethercat main station controllers and its communication means and system | |
CN114356578A (en) | Parallel computing method, device, equipment and medium for natural language processing model | |
CN101119365B (en) | Cooperation interaction optimizing method under vast scale cooperative surroundings | |
CN116431562B (en) | Multi-head attention mechanism fusion calculation distribution method based on acceleration processor | |
CN103502941B (en) | A kind of method for parallel processing and device | |
CN114839879A (en) | Autonomous device decision control method based on distributed reinforcement learning | |
CN115904681A (en) | Task scheduling method and device and related products | |
CN106453656A (en) | Cluster host selection method and device | |
CN112446485B (en) | Neural network collaborative training method and device and related products | |
CN116991483B (en) | Pipeline parallel method and device for language model calculation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |