CN111210020A - Method and system for accelerating distributed machine learning - Google Patents

Method and system for accelerating distributed machine learning Download PDF

Info

Publication number
CN111210020A
CN111210020A CN201911155664.4A CN201911155664A CN111210020A CN 111210020 A CN111210020 A CN 111210020A CN 201911155664 A CN201911155664 A CN 201911155664A CN 111210020 A CN111210020 A CN 111210020A
Authority
CN
China
Prior art keywords
machine learning
priority
parameters
distributed machine
connections
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911155664.4A
Other languages
Chinese (zh)
Other versions
CN111210020B (en
Inventor
李丹
王帅
耿金坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201911155664.4A priority Critical patent/CN111210020B/en
Publication of CN111210020A publication Critical patent/CN111210020A/en
Application granted granted Critical
Publication of CN111210020B publication Critical patent/CN111210020B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2425Traffic characterised by specific attributes, e.g. priority or QoS for supporting services specification, e.g. SLA
    • H04L47/2433Allocation of priorities to traffic types
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • H04L47/6215Individual queue per QOS, rate or priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • H04L47/625Queue scheduling characterised by scheduling criteria for service slots or service orders
    • H04L47/6275Queue scheduling characterised by scheduling criteria for service slots or service orders based on priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority

Abstract

The embodiment of the invention provides a method and a system for accelerating distributed machine learning, wherein the method comprises the following steps: inputting a machine learning model into a distributed machine learning system, establishing a plurality of connections between any two nodes in the distributed machine learning system, and giving different priorities to the connections; and distributing the parameters of the machine learning model to a plurality of connections for transmission, so that the emergency parameters can be transmitted as soon as possible through high-priority connections, and the training of distributed machine learning is accelerated. The embodiment of the invention considers the overlapping of communication and calculation in the forward calculation process, reduces the randomness of the parameter transmission sequence under the existing machine learning framework by preferentially transmitting the emergency parameters, overlaps the communication and calculation in the forward calculation process, hides the communication overhead, coordinates the communication of different nodes through network-level flow scheduling, realizes distributed communication scheduling, and further accelerates the training process.

Description

Method and system for accelerating distributed machine learning
Technical Field
The invention relates to the technical field of computers, in particular to a method and a system for accelerating distributed machine learning.
Background
As the amount of data grows, distributed training based on data parallelism has become a widely adopted machine learning acceleration in the industry. In data parallelism, different compute nodes have multiple copies of the same model, each compute node uses different training data to compute model updates, and then model updates are summarized among all compute nodes. After the model summary is completed, a new round of calculation is started. With the increase of the model size (e.g. the number of parameters of the BERT model is up to 3 hundred million), the communication overhead brought by the model aggregation gradually becomes a large factor influencing the performance of distributed machine learning.
The machine learning calculation is carried out layer by layer and can be divided into two stages of forward calculation and backward propagation: the forward calculation is responsible for the loss of the calculation model, and the backward propagation is responsible for the update of the calculation model. In order to reduce the influence of communication on the learning performance of the distributed machine, the existing scheme generally adopts a 'wait-free back propagation' mode, namely, during back propagation, the transmission of the update of the back layer model and the calculation of the update of the front layer model are overlapped so as to hide the overhead of the transmission of the update of the model as much as possible.
However, the existing schemes do not consider the overlapping of communication and calculation in the forward calculation process, and the transmission sequence of parameters is random, so that the iteration time is greatly increased, and the machine learning efficiency is low.
Disclosure of Invention
In order to solve the above problem, embodiments of the present invention provide a method and a system for accelerating distributed machine learning.
In a first aspect, an embodiment of the present invention provides a method for accelerating distributed machine learning, including: inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;
and distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
Preferably, the allocating the parameters of the machine learning model to a plurality of connections for transmission specifically includes:
setting a priority for each connection in the distributed machine learning system;
for any priority connection, acquiring a target parameter corresponding to the connection of any priority, wherein the target parameter is the machine learning model parameter required to be transmitted by the connection of any priority;
and the connection with any priority transmits the target parameters.
Preferably, the method further comprises the following steps: and taking any priority as the priority of the target parameter.
Preferably, for any two target parameters with different priorities, when the two target parameters enter the network, the target parameter with high priority is transmitted first.
Preferably, the acquiring the target parameter corresponding to the connection of any priority specifically includes:
calculating the distribution proportion of the parameters of the machine learning model among different priorities through a heuristic algorithm;
and acquiring a target parameter corresponding to the connection with any priority according to the distribution proportion.
In a second aspect, an embodiment of the present invention provides a system for accelerating distributed machine learning, including: the connection module is used for inputting a machine learning model into the distributed machine learning system, and for any two nodes in the distributed machine learning system, a plurality of connections are established between any two nodes;
and the training module is used for distributing the parameters of the machine learning model to a plurality of connections for transmission and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method for accelerating distributed machine learning according to the first aspect of the present invention.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for accelerating distributed machine learning according to the first aspect of the present invention.
According to the method and the system for accelerating distributed machine learning, provided by the embodiment of the invention, the overlapping of communication and calculation in the forward calculation process is considered, the randomness of the parameter transmission sequence under the existing machine learning framework is reduced by preferentially transmitting emergency parameters, the communication and calculation in the forward calculation process can be overlapped, the communication overhead is hidden, the communication of different nodes is coordinated through network-level flow scheduling, the distributed communication scheduling is realized, and the training process is accelerated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic diagram comparing random transmission and sequential transmission of parameters;
FIG. 2 is a flow chart of a method for accelerating distributed machine learning according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a comparison of network scheduling and end-side scheduling in an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a system for accelerating distributed machine learning according to an embodiment of the present invention;
fig. 5 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The machine learning calculation is carried out layer by layer and can be divided into two stages of forward calculation and backward propagation: the forward calculation is responsible for the loss of the calculation model, and the backward propagation is responsible for the update of the calculation model. In order to reduce the influence of communication on the learning performance of the distributed machine, the existing scheme generally adopts a 'wait-free back propagation' mode, namely, during back propagation, the transmission of the update of the back layer model and the calculation of the update of the front layer model are overlapped so as to hide the overhead of the transmission of the update of the model as much as possible.
However, the existing schemes do not consider the overlapping of communication and calculation in the forward calculation process, and the parameter transmission sequence has randomness, so that the distributed machine learning training performance still needs to be improved. Based on this, the invention aims to design a distributed machine learning acceleration scheme based on flow scheduling.
Fig. 1 is a schematic diagram comparing random transmission and sequential transmission of parameters, as shown in fig. 1, where (a) in fig. 1 represents random transmission of parameters of a machine learning model, and (b) in fig. 1 represents sequential transmission of parameters of the machine learning model, where we assume that the machine learning model has 4 layers, and the order from front to back is layer1, layer1, layer2, layer3 and layer4, i.e. 1, 2, 3 and 4 in the figure. In the figure, Pull represents a Pull parameter, Push represents a Push gradient, ForwardPass represents forward calculation, Back-Propagation (BP for short) represents backward Propagation, and Aggregation represents parameter summarization. Network denotes the Network, Worker denotes the Worker node, and PS denotes the parameter server.
Referring to fig. 1 (b), at the beginning of the execution process, the machine learning model parameters are first obtained, and after the parameters of layer1 reach worker from PS, layer1 may start to calculate, at this time, the transmission of the parameters of layer2 may be parallel to the calculation of layer1, when the calculation of layer1 is completed, the parameters of layer2 also reach worker, the calculation of layer2 may also start, and so on, until the calculation of the last layer of layer4 is completed.
Then the BP process is started, BP is reversed, the sequence from layer4 to layer1, after the gradient calculation of layer4 is completed, it can be transferred to PS, and at the same time, the gradient calculation of layer3 can be performed in parallel. When the gradient of layer4 reaches the PS, the PS can summarize the gradients on different workers and update the parameters. And until all gradients are summarized, one iteration calculation is finished, and the next iteration can be started.
Referring to fig. 1 (a), when the parameters reach the worker in the order of 3, 2, 4 and 1, the worker can start the calculation of layer1 only if receiving the parameters of layer1, and the calculation of layer2 depends on the calculation result of layer1, so that the calculation cannot be started even if receiving the parameters of layer 2. The figure also shows an iterative process.
Comparing fig. 1 (a) and fig. 1 (b), it can be seen that if the parameters are transmitted in sequence, the iteration time is greatly shortened, and thus the distributed machine learning training process is accelerated.
Geryon, a distributed machine learning training acceleration scheme based on flow scheduling, is designed, communication and calculation in the forward calculation process can be overlapped, communication overhead is hidden, communication of different nodes is coordinated through flow scheduling of the whole network scale, distributed communication scheduling is achieved, and therefore the distributed machine learning training process is accelerated.
The idea of the method provided by the invention is as follows: the forward calculation of machine learning is layer-by-layer and has dependency on parameters, and by designing a reasonable scheduling strategy, communication and calculation of different layers are fully overlapped to hide communication overhead, emergency parameters are transmitted preferentially, and training speed is improved. Specifically, the distributed machine learning training is accelerated by adopting the following three strategies:
first, "multiple connection policy": the existing distributed machine learning system only has one connection between any two computing nodes, and Geryon allocates different machine learning model parameters to different connections for transmission by establishing multiple connections between the computing nodes in the distributed machine learning system, so that queuing time delay of emergency parameters in a single sending queue at the end side is reduced.
Fig. 2 is a flowchart of a method for accelerating distributed machine learning according to an embodiment of the present invention, as shown in fig. 2, the method includes:
s1, inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;
and S2, distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
Firstly, a machine learning model is input into a distributed machine learning system, any two nodes in the distributed machine learning system refer to a worker node and a parameter server node, namely, a plurality of connections are established between any worker node and any parameter server node, and different parameters of the machine learning model are distributed to different connections for transmission, so that queuing delay of emergency parameters in a single sending queue at an end side is reduced.
Specifically, the allocating the parameters of the machine learning model to multiple connections for transmission specifically includes:
setting a priority for each connection in the distributed machine learning system;
for any priority connection, acquiring a target parameter corresponding to the connection of any priority, wherein the target parameter is the machine learning model parameter required to be transmitted by the connection of any priority;
and the connection with any priority transmits the target parameters.
Second, "multi-priority policy": geryon assigns different priorities to different connections, assigns higher priority to the connection carrying the emergency parameters, and adopts strict priority scheduling at both the end side and the switch, thereby ensuring that the emergency parameters carried by the high-priority connection can be transmitted preferentially in the whole network.
Geryon requires both the network card (i.e., the end side) and the switch to be configured with strict priority scheduling policies. Geryon assigns different parameters to different connections, each connection having a priority, and this priority label is carried in each packet of the connection. The end-side and the switch rely on the priority label on the packets for strict priority scheduling of these packets.
Specifically, the method further comprises the following steps: and taking any priority as the priority of the target parameter.
It should be noted that, since each packet includes a priority flag, the priority on the packet is the priority of the connection transmitting the packet, so that the switch can distinguish the parameters with different degrees of urgency from different nodes to implement distributed communication scheduling.
On the basis of the above embodiment, preferably, the distribution ratio of the parameters of the machine learning model among different priorities is calculated through a heuristic algorithm;
and acquiring a target parameter corresponding to the connection with any priority according to the distribution proportion.
Third, "adaptive parameter allocation policy": geryon supports the use of any number of priorities, requiring only a minimum of two priorities. Aiming at different priority numbers, the segmentation proportion of the parameters among different priorities is calculated through a heuristic algorithm, namely, in all the parameters, which parameters are distributed to the high-priority connection and which parameters are distributed to the low priority, so that a self-adaptive parameter distribution strategy is realized, and the usability is improved.
On the basis of the above embodiment, preferably, for any two target parameters with different priorities, when the two target parameters enter the network, the target parameter with high priority is transmitted first.
Fig. 3 is a schematic diagram illustrating comparison between network scheduling and end-side scheduling in an embodiment of the present invention, as shown in fig. 3, PS0 denotes a parameter server 0, PS1 denotes a parameter server 1, network fabric denotes a network structure, network-level scheduling denotes a network scheduling policy, end-host-level scheduling denotes an end-side scheduling policy, low-priority parameter in PS0 denotes that a parameter maintained on PS0 is low-priority, high-priority parameter in PS1 denotes that a parameter maintained on PS1 is high-priority, flow direction denotes flow direction, and physical link denotes physical connection.
In the distributed machine learning system, a plurality of PSs and a plurality of workers exist, only one worker and two PSs are placed for clarity of illustration, and then two schemes of controlling parameter transmission sequence through stream scheduling and controlling parameter sending sequence through an application layer are compared. On the left is the network schedule and on the right is the end-side schedule.
Assuming that the low priority parameter is located on PS0 and the high priority parameter is located on PS1, when the two parameters enter the network, the end-side scheduling cannot control the transmission order of the two parameters, so that the two parameters compete for network resources together and the transmission time is long.
For network scheduling, it can still be ensured that the high priority parameter is transmitted preferentially, the low priority parameter does not occupy the bandwidth, and the low priority parameter is transmitted only after the high priority parameter is transmitted. Therefore, the network scheduling can ensure the prior transmission of the emergency parameters in the whole network range.
Therefore, in the application, when two target parameters with different priorities enter the network, the high-priority parameter is transmitted first, and only after the high-priority parameter is transmitted, the low-priority parameter is transmitted. Therefore, the network scheduling can ensure the prior transmission of the emergency parameters in the whole network range.
The embodiment of the invention realizes a join flow scheduling mechanism based on RDMA of TensorFlow, uses two priority queues, measures common benchmark models such as AlexNet, VGG-19, ResNet-50 and YOLO on a distributed machine learning cluster consisting of 8 servers each containing a K40c GPU, and compares the models with standard TensorFlow distributed training and TensorFlow distributed training improved by load balancing. Different schemes all adopt the same hyper-parameter setting, and the convergence of the model is not influenced.
The comparison index is the training throughput of different schemes, namely the number of pictures trained in unit time.
Experimental results show that in a 10G network environment, Geryon can improve the training throughput by 4.37 times compared with standard TensorFlow distributed training, and the training throughput is improved by 1.2 times compared with TensorFlow distributed training added with load balance improvement.
Further, we compare with the end-side communication scheduling scheme, and find that the performance advantage of Geryon over the end-side communication scheduling scheme is more and more obvious as the number of parameter units increases. When the number of slices of the VGG-19 model reaches 680, the training throughput of Geryon is 1.25 times of the end-side communication scheduling performance.
Fig. 4 is a schematic structural diagram of a system for accelerating distributed machine learning according to an embodiment of the present invention, as shown in fig. 4, the system includes a connection module 401 and a training module 402, where:
the connection module 401 is configured to input a machine learning model into a distributed machine learning system, and for any two nodes in the distributed machine learning system, establish multiple connections between the any two nodes;
the training module 402 is configured to distribute parameters of the machine learning model over multiple connections for transmission to accelerate distributed machine learning training.
First, the connection module 401 inputs the machine learning model into the distributed machine learning system, and establishes a plurality of connections for any two nodes in the distributed machine learning system.
The training module 402 assigns parameters of the machine learning model to multiple connections for transmission and gives different priorities to the multiple connections, so that training of distributed machine learning is accelerated.
The system embodiment provided in the embodiments of the present invention is for implementing the above method embodiments, and for details of the process and the details, reference is made to the above method embodiments, which are not described herein again.
Fig. 5 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)501, a communication Interface (Communications Interface)502, a memory (memory)503, and a bus 504, wherein the processor 501, the communication Interface 502, and the memory 503 are configured to communicate with each other via the bus 504. The communication interface 502 may be used for information transfer of an electronic device. The processor 501 may call logic instructions in the memory 503 to perform a method comprising:
inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;
and distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
In addition, the logic instructions in the memory 503 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes:
inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;
and distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of accelerating distributed machine learning, comprising:
inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;
and distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
2. The method for accelerating distributed machine learning according to claim 1, wherein the allocating parameters of the machine learning model to a plurality of connections for transmission comprises:
setting a priority for each connection in the distributed machine learning system;
for any priority connection, acquiring a target parameter corresponding to the connection of any priority, wherein the target parameter is the machine learning model parameter required to be transmitted by the connection of any priority;
and the connection with any priority transmits the target parameters.
3. The method of accelerating distributed machine learning according to claim 1, further comprising: and taking any priority as the priority of the target parameter.
4. The method of accelerating distributed machine learning according to claim 3, wherein for any two target parameters with different priorities, the target parameter with high priority is transmitted first when the two target parameters enter the network.
5. The method for accelerating distributed machine learning according to claim 2, wherein the acquiring the target parameter corresponding to the connection of any priority specifically includes:
calculating the distribution proportion of the parameters of the machine learning model among different priorities through a heuristic algorithm;
and acquiring a target parameter corresponding to the connection with any priority according to the distribution proportion.
6. A system for accelerating distributed machine learning, comprising:
the connection module is used for inputting a machine learning model into the distributed machine learning system, and for any two nodes in the distributed machine learning system, a plurality of connections are established between any two nodes;
and the training module is used for distributing the parameters of the machine learning model to a plurality of connections for transmission and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for accelerating distributed machine learning according to any one of claims 1 to 5 are implemented when the program is executed by the processor.
8. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the method of accelerating distributed machine learning according to any one of claims 1 to 5.
CN201911155664.4A 2019-11-22 2019-11-22 Method and system for accelerating distributed machine learning Active CN111210020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911155664.4A CN111210020B (en) 2019-11-22 2019-11-22 Method and system for accelerating distributed machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911155664.4A CN111210020B (en) 2019-11-22 2019-11-22 Method and system for accelerating distributed machine learning

Publications (2)

Publication Number Publication Date
CN111210020A true CN111210020A (en) 2020-05-29
CN111210020B CN111210020B (en) 2022-12-06

Family

ID=70789167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911155664.4A Active CN111210020B (en) 2019-11-22 2019-11-22 Method and system for accelerating distributed machine learning

Country Status (1)

Country Link
CN (1) CN111210020B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112333234A (en) * 2020-09-23 2021-02-05 清华大学 Distributed machine learning training method and device, electronic equipment and storage medium
CN113485805A (en) * 2021-07-01 2021-10-08 曙光信息产业(北京)有限公司 Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform
CN113537507A (en) * 2020-09-02 2021-10-22 腾讯科技(深圳)有限公司 Machine learning system, method and electronic equipment
CN114764353A (en) * 2021-01-13 2022-07-19 戴尔产品有限公司 ML to ML orchestration system and method for Information Handling System (IHS) full system optimization
CN116258197A (en) * 2023-05-16 2023-06-13 之江实验室 Distributed training acceleration method and system based on parameter calculation and communication scheduling

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220949A1 (en) * 2016-01-29 2017-08-03 Yahoo! Inc. Method and system for distributed deep machine learning
CN108009642A (en) * 2016-10-31 2018-05-08 腾讯科技(深圳)有限公司 Distributed machines learning method and system
US20190019104A1 (en) * 2017-07-12 2019-01-17 Sap Se Distributed Machine Learning On Heterogeneous Data Platforms
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220949A1 (en) * 2016-01-29 2017-08-03 Yahoo! Inc. Method and system for distributed deep machine learning
CN108009642A (en) * 2016-10-31 2018-05-08 腾讯科技(深圳)有限公司 Distributed machines learning method and system
US20190171952A1 (en) * 2016-10-31 2019-06-06 Tencent Technology (Shenzhen) Company Limited Distributed machine learning method and system
US20190019104A1 (en) * 2017-07-12 2019-01-17 Sap Se Distributed Machine Learning On Heterogeneous Data Platforms
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DONGSHENG LI等: "HPDL: Towards a General Framework for High-performance Distributed Deep Learning", 《2019 IEEE 39TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS)》 *
JINKUN GENG等: "Accelerating Distributed Machine Learning by Smart Parameter Server", 《PROCEEDINGS OF THE 3RD ASIA-PACIFIC WORKSHOP ON NETWORKING 2019》 *
SAYED HADI HASHEMI等: "TicTac: Accelerating Distributed Deep Learning with Communication Scheduling", 《PROCEEDINGS OF MACHINE LEARNING AND SYSTEMS 1 (MLSYS 2019)》 *
李莉等: "泄漏电流数据的 Spark-KNN 并行模式识别方法", 《系统仿真学报》 *
王岳青等: "DLPF:基于异构体系结构的并行深度学习编程框架", 《计算机研究与发展》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537507A (en) * 2020-09-02 2021-10-22 腾讯科技(深圳)有限公司 Machine learning system, method and electronic equipment
CN112333234A (en) * 2020-09-23 2021-02-05 清华大学 Distributed machine learning training method and device, electronic equipment and storage medium
CN114764353A (en) * 2021-01-13 2022-07-19 戴尔产品有限公司 ML to ML orchestration system and method for Information Handling System (IHS) full system optimization
CN114764353B (en) * 2021-01-13 2023-09-29 戴尔产品有限公司 ML to ML orchestration system and method for Information Handling System (IHS) all system optimization
CN113485805A (en) * 2021-07-01 2021-10-08 曙光信息产业(北京)有限公司 Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform
CN113485805B (en) * 2021-07-01 2024-02-06 中科曙光(南京)计算技术有限公司 Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform
CN116258197A (en) * 2023-05-16 2023-06-13 之江实验室 Distributed training acceleration method and system based on parameter calculation and communication scheduling
CN116258197B (en) * 2023-05-16 2023-09-08 之江实验室 Distributed training acceleration method and system based on parameter calculation and communication scheduling

Also Published As

Publication number Publication date
CN111210020B (en) 2022-12-06

Similar Documents

Publication Publication Date Title
CN111210020B (en) Method and system for accelerating distributed machine learning
EP3399426B1 (en) Method and device for training model in distributed system
EP3380937B1 (en) Techniques for analytics-driven hybrid concurrency control in clouds
Zhang et al. Online adaptive interference-aware VNF deployment and migration for 5G network slice
CN105511954B (en) Message processing method and device
CN109697122B (en) Task processing method, device and computer storage medium
US20160094413A1 (en) Network Resource Governance in Multi-Tenant Datacenters
CN110570075B (en) Power business edge calculation task allocation method and device
CN112346833B (en) Task processing method and processor for privacy computation and heterogeneous processing system
CN112333234B (en) Distributed machine learning training method and device, electronic equipment and storage medium
CN115237580B (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN112148454A (en) Edge computing method supporting serial and parallel and electronic equipment
CN110780985A (en) Parallel task scheduling method and device with limited time
CN107332814A (en) A kind of request message transmission method and device
CN108028806A (en) The method and apparatus that virtual resource is distributed in network function virtualization NFV networks
CN113157465B (en) Message sending method and device based on pointer linked list
WO2022062648A1 (en) Automatic driving simulation task scheduling method and apparatus, device, and readable medium
Dong et al. TINA: A fair inter-datacenter transmission mechanism with deadline guarantee
CN113849295A (en) Model training method and device and computer readable storage medium
CN115665258B (en) Priority perception deployment method of multi-target service function chain based on deep reinforcement learning
CN114780228B (en) Hybrid cloud resource creation method and system
CN110971451A (en) NFV resource allocation method
CN114461369A (en) Adaptive data scheduling system and method for complex application scene
CN107025099B (en) Asynchronous graph calculation implementation method and system based on double-queue model
Zhang et al. RICH: Strategy-proof and efficient coflow scheduling in non-cooperative environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant