CN115314397B

CN115314397B - Network simulation method, system, device and storage medium for distributed training

Info

Publication number: CN115314397B
Application number: CN202210937411.8A
Authority: CN
Inventors: 谭光明; 朱泓睿; 吴长亮; 李文喆; 元国军; 王展; 安学军
Original assignee: Western Research Institute Of China Science And Technology Computing Technology
Current assignee: Western Research Institute Of China Science And Technology Computing Technology
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2023-07-21
Anticipated expiration: 2042-08-05
Also published as: CN115314397A

Abstract

The invention relates to the technical field of distributed training, and particularly discloses a network simulation method, a system, a device and a storage medium for distributed training, wherein the method comprises the steps of parameter reading, topology construction, algorithm selection, data selection and construction, simulator operation algorithm, judging whether a simulator is simulated to be ended, simulating program ending, outputting final simulation time and the like. By adopting the technical scheme of the invention, the distributed training process can be simulated, and the training duration can be simulated.

Description

Network simulation method, system, device and storage medium for distributed training

Technical Field

The present invention relates to the field of distributed training technologies, and in particular, to a network simulation method, system, device, and storage medium for distributed training.

Background

With the recent achievement of better effects of deep learning algorithms in various tasks than conventional machine learning algorithms, deep learning has been widely used in the fields of image recognition, audio recognition, natural language processing, and the like. The research heat of deep learning and related crossing fields is improved year by year, and a plurality of actual achievements are obtained in the industry, so that the application range is wider and wider.

To reduce the cost of training time, more and more people are beginning to adopt a distributed training approach. Currently, a multi-machine multi-card (accelerator) method is mainly adopted to realize high-parallelization training. However, when training of a single node or accelerator is extended to multiple machines, network communication becomes a major bottleneck that constrains the distributed speed ratio. In order to facilitate the research of problems existing in communication in distributed training and the subsequent optimization scheme, practical tests, particularly ultra-large-scale tests, are high in cost and inconvenient to construct and try new schemes. Network simulators are becoming a potential as an important tool for research by researchers associated with computer networks.

The current network simulator is mainly based on event-driven simulation and is mainly used for researching aspects of network topology, flow, protocols and the like. In distributed deep learning, the main problems that are different from traditional network research are:

1. the main task of traditional network research is to research network characteristics such as bandwidth utilization, average delay, packet loss rate, etc.; while the simulation end goal of distributed training is to effectively estimate the training time of the task.

2. The communication traffic of the traditional network research adopts injection type, and the communication behavior in the distributed training is strongly dependent on the calculation and communication behavior of the previous step (including the alternation, the forward propagation calculation and the Allreduce) and has high dependency.

3. The distributed training adopts a communication method such as collective communication and the like which is different from the traditional communication method adopting point-to-point communication.

Therefore, the process of simulating distributed training cannot be simply simulated according to the method of the traditional network simulator, and more problems need to be designed and solved.

Disclosure of Invention

One of the purposes of the invention is to provide a network simulation method for distributed training, which can simulate the distributed training process and simulate the training time.

In order to solve the technical problems, the application provides the following technical scheme:

a network simulation method for distributed training, comprising the steps of:

step 100, acquiring a compiled main operation program, a preset configuration file and input data, and starting the operation of the simulator;

step 110, reading parameters in a configuration file when the simulator runs;

step 120, constructing a topology, and constructing a complete topology and a routing rule according to the parameters read in the step 110;

step 130, selecting algorithms, namely selecting a plurality of algorithms as basic algorithms for running a program according to the read parameters in step 110;

step 140, data selection and construction, namely acquiring the name and batch size information of the training neural network according to the parameters in step 110, and selecting a corresponding data set file locally or generating the data set file in real time; generating a data format readable by a subsequent program according to the data set file processing;

step 150, a simulator runs an algorithm, and according to the information provided by the steps 120-140, the behavior of the integrated communication algorithm is simulated in turn according to different data sizes and times;

step 160, judging whether the simulator is simulated to be ended, and if so, jumping to step 170; otherwise, loop execution step 150;

step 170, the simulation program ends, and the final simulation time is output.

Further, in the step 110, the parameters include an application name, a topology parameter, an algorithm name, a data name, a training neural network name, and a batch size.

Further, the step 150 of running an algorithm by the simulator specifically includes:

step 211, initializing parameters required by operation;

step 212, starting to operate after the parameter initialization is finished;

step 213, sending a signal for starting Allreduce;

step 220, receiving a message to be processed, wherein the message comprises an internal signal and an external signal; selecting a next process according to the signal type, jumping to step 230 if the signal is an internal signal, and jumping to step 240 if the signal is an external signal;

step 230, starting to process the next new allreduce;

step 231, resetting step of allreduce, and increasing index of allreduce;

step 240, processing the current allreduce, and judging whether the task is completed according to the step number of the current allreduce;

step 241, when all tasks of the current step are completed, starting to execute the next step, otherwise, not performing operation; if in the process of processing internal signals, go to step 261; step 251 is skipped if in the process of processing an external signal;

step 251, judging whether the target accelerator in the external signal is identical to the current node, and is the current allreduce and step;

if the accelerator is identical with the current node, the accelerator is the current allreduce and step, and the accelerator is legal;

if allreduceindex or allreducistep is larger than the current value, the advanced packet is considered to be put into a buffer queue for later use;

if the allreduceindex or allreductep is smaller than the current value, directly reporting error and exiting the whole program operation;

step 252, when the external signal is not null and the external signal is judged to be legal in step 251, the external signal is analyzed for parameters, the current step receiving table is updated, and the process goes to step 262;

step 261, if the current step needs to send communication data to other objects, executing a function of the sending process;

step 262, detecting the completion status of allreduce, if the step number of allreduce is the rated step number and the step has completed all the sending and receiving processes, determining that allreduce has been completed, and skipping to step 272; otherwise jump to step 271;

step 271, when the current step has been completed, updating the step dex procedure, and jumping to step 231;

step 272, determining whether all allreduces have ended, if so, jumping to step 273, otherwise jumping to step 274;

step 273, ending the current whole program;

step 274, when the current allreduce is completed, the next allreduce starting process is performed;

step 275, calculating the data volume and communication time of the next communication according to whether to start the next item and the value of the data set;

step 276, the communication time calculated in step 275 is scheduled to transmit the allreduce start signal, and the process goes to step 220.

Further, the data amount and the communication time of the next communication calculated in the step 275 are specifically:

if the data are the same item, selecting the next dataline, and acquiring a time stamp and traffic;

the communication time is equal to max (current simulator timestamp, next ideal start time);

if not identical iter: the communication time is equal to max (current simulator timestamp, next item ideal start time).

The second object of the present invention is to provide a network simulation device for distributed training, comprising a host, a computing accelerator and a network card;

the host comprises an in-host routing module for providing support for communication between the internal modules of the host;

the computing accelerator comprises a message application module and a reduce module, wherein the message application module is used for receiving and sending messages; the reduce module is used for simulating the GPU to perform reduce calculation;

the network card is used for connecting the host computers.

Further, the computing accelerator further comprises a PCIe port module, an Nvlink port module and a GPU internal routing module;

the PCIe port module is an access port for connecting the computing accelerator with an external PCIe bus;

the Nvlink port module is used for connecting a computing accelerator with other access ports;

the GPU internal routing module is connected with the message application module, the PCIe port module, the Nvlink port module, the reduce module and the like and the GPU internal routing module and is used for providing support for communication among the computing accelerator internal modules.

Further, the network card includes:

the broadcast port module is used for realizing a broadcast protocol in the network card;

the internal port module is used for communication between the network card and other external modules;

the network card routing module is used for realizing communication among different modules in the network card;

the splitting protocol module is used for simulating the packetization flow of the TCP/UDP/IB communication protocol;

the outlet port module is used for connecting with a host outside the host to realize communication between the hosts;

and the merging protocol module is used for merging the plurality of protocol packets into original data.

It is a third object of the present invention to provide a network simulation system for distributed training, using the above-described network simulation apparatus for distributed training.

A fourth object of the present invention is to provide a storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described network simulation method for distributed training.

Compared with the prior art, the invention has the advantages that:

the invention ensures the realization process of the simulator for simulating the distributed deep learning training scene, and simultaneously ensures the dependency relationship (inter-item, allreduce and step) of the training scene transceiver, thereby being capable of simulating the distributed training process and simulating the training duration. The performance of various scenes under large-scale distributed training can be simulated, performance research and technical pre-research can be conveniently carried out by researchers, the time cost and equipment cost of real training are reduced, and performance problems and bottlenecks can be conveniently found by performance regulators.

Drawings

FIG. 1 is a flow chart of a network simulation method for distributed training according to an embodiment;

FIG. 2 is a flowchart of the algorithm operation steps in a network simulation method for distributed training according to an embodiment;

FIG. 3 is a logic block diagram of a network simulation device for distributed training according to an embodiment.

Detailed Description

The following is a further detailed description of the embodiments:

examples

As shown in fig. 1, the network simulation method for distributed training of the present embodiment includes the following steps:

step 100, acquiring a compiled main operation program, a preset configuration file, input data and the like, and starting the operation of the whole simulator. The input data includes neural network acquisition files and the like.

Step 110, parameters in the configuration file are read when the simulator runs, wherein the parameters comprise application names, topology parameters, algorithm names, data names, each path of bandwidth delay definition, names of training neural networks, sizes of batch size and the like.

And 120, constructing a topology, and constructing complete topology and routing rules according to the parameters read in the step 110. For example, the name of the topology to be constructed is obtained, and then specific parameters such as the size, bandwidth, delay and the like of the topology are obtained according to the topology parameters. For example topology 2DTorus needs to know the number of nodes of the x-coordinate y-coordinate etc.

And 130, selecting algorithms, namely selecting a plurality of algorithms as basic algorithms for program operation according to the names of the algorithms in the read parameters in the step 110. In this embodiment, the algorithm includes Ring, recursiveDoubling and the like.

Step 140, selecting and constructing data, namely acquiring information such as the name of a training neural network, the batch size (the size of a sample used for single training in a training scene) and the like according to the parameters in step 110, and locally selecting a corresponding data set file or generating the data set file in real time; a data format readable by a subsequent program, such as Array, is then generated from the dataset file processing.

Step 150, the simulator runs the algorithm, and the behavior of the integrated communication algorithm is simulated in turn according to the information provided by the steps 120-140 and different data sizes and times.

Step 160, judging whether the simulator is simulated to be ended, and if so, jumping to step 170; otherwise, step 150 is performed in a loop.

Step 170, the simulation program ends, and the final simulation time is output.

As shown in fig. 2, the simulator operation algorithm in step 150 specifically includes:

step 211, initializing parameters required by operation, wherein the parameters comprise the number of current nodes, the number of GPUs in the nodes, the current algorithm, the fusion size and the like, and in the embodiment, the parameters are used for initializing a function_init_parameters ();

step 212, starting to run after the parameter initialization is finished, and starting to run a function_start_run ();

step 213, send a signal to start Allreduce (one of the types of aggregate communication is one of the main communication modes of data parallelism), specifically, start runtime calls to execute the next Allreduce. Since the allreduce process is processed according to unified logic, a signal is sent that can start executing the next (currently the first) allreduce;

step 220, receiving a message to be processed, where the message includes an internal signal, an external signal, and the like. The next step is selected according to the signal type and jumps to step 230 if it is an internal signal (typically a send signal) and jumps to step 240 if it is an external signal (i.e. an external ingress packet, typically a receive signal, referring to a communication message from an external send to the node). In the embodiment, a comprehensive unit function handleMessage ()' for processing various signals is adopted;

step 230, start processing the next new allreduce. In this embodiment, the function of processing the next Allreduce is_handle_allreduce ();

step 231, step (step, single-set communication (such as allreduce) may be split into multiple steps, and there is a certain dependency between the steps), and index (index, which refers to a unique sequence number, generally starting from 0 or 1 and gradually increasing) of allreduce is incremented, so that the processing needs to be performed because of the need to start processing a new allreduce, and in this embodiment, the allreduce status function_resetsep () is reset.

Step 240, processing the current allreduce, and judging whether the task is completed according to the step number of the current allreduce, wherein the task comprises whether operations such as receiving, sending and the like are required to be executed. The function_handle_step () of the current allreduce is performed in this embodiment.

Step 241, when all tasks of the current step have been completed, it is determined that the current step is empty, and the next step may be skipped (i.e. the execution of the next step is started), otherwise, no operation is performed. I.e., the current step has completed all or no transmit tasks and has completed or no receive tasks. Step 261 is skipped if in the process of processing an internal signal, and step 251 is skipped if in the process of processing an external signal; in this embodiment, a skip empty step function_skip_step () is performed.

Step 251, determine whether dst (destination accelerator, referred to as GPU module serial number received by communication) in the external signal matches with the current node, whether it is current allreduce and step, etc. If the current node is identical to the current node, the current allreduce and step are legal; if allreduceindex or allreducistep is greater than the current value, it is considered a leading packet, which is put into a buffer queue for later use. If the error is smaller than the current value or other error conditions, the program logic design error is considered, and the whole program operation is directly reported to be exited. In this embodiment, a function of detecting the validity of reception_checkrecvvalidpushBuff ().

Step 252, when the external signal is not null and step 251 determines that it is legal, it is considered as the packet that the current allreduce current step needs to receive. Updating the current step reception table after analyzing the parameters of the external signal, indicating that the reception of the packet has been completed, and jumping to step 262; function_recvStep ()'s of the reception process are performed in the present embodiment;

step 261, if the current step needs to send communication data to other objects, the function_sendstep () of the sending process is executed.

Step 262, detect the completion status of allreduce. If the step number of the allreduce is the rated step number and the step has completed all the sending and receiving processes, then determining that the allreduce has completed, jumping to step 272; otherwise step 271 is skipped.

Step 271, update the stepndex process when the current step has been completed. In this embodiment, the function_inventeStep () from the next stage of Allreduce is executed, and the process goes to step 231;

step 272, it is determined whether all allreduces have ended, if so, the process goes to step 273, otherwise, the process goes to step 274.

Step 273, end the current overall procedure.

Step 274, the current allreduce is completed, and the next allreduce starting process is performed, and in this embodiment, the function_endstartnewallreduce () of the next allreduce is started.

Step 275, the data amount and communication time of the next communication are calculated according to whether the next item (iteration number) is started and the value of the data set.

Specifically, if the same item:

selecting the next dataline, and acquiring a time stamp and traffic;

time = max (current simulator timestamp, next ideal start time);

if not identical iter:

time=max (current simulator timestamp, next iter ideal start time).

The basic method and the implementation process for simulating the distributed deep learning training scene by the simulator are ensured through steps 110-170. Through steps 211-276, the dependency relationships (inter-item, allreduce, step) and the like of the training scene transceiving packets are ensured.

The scalable cache access method described above may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the method embodiment. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), an electrical carrier signal, a telecommunication signal, a software distribution medium, and so forth.

Based on the network simulation method for distributed training, the embodiment further provides a network simulation device for distributed training, as shown in fig. 3, including: host, compute accelerator, and network card NIC.

The Host adopts a server or a computer which independently operates, and in the embodiment, the Host adopts the computer, and the Host internally comprises a GPU, a network card, a CPU and other components. And the built-in routing module PCIERouter is used as a switching module for communication among all modules in the Host and is used for providing support for communication among multiple modules in the Host.

The computing accelerator, in this embodiment, uses a GPU, which may be plural, and the number is represented by num_gpu. The computing accelerator includes:

the message application module, which is the main module for message generation and destruction, is a packet receiving and sending program in this embodiment, and is configured to perform the receiving and sending of the message in step 150.

And a GPU internal routing module. The device is connected with a message application module, a PCIe port module, an Nvlink port module, a reduce module and the like, and the GPU internal routing module is used for providing support for communication among a plurality of modules in the GPU.

PCIe port module PCIe. Refers to an access port where the GPU is connected to an external PCIe bus.

Nvlink port module Nvlink. Refers to the GPU connecting with the other GPU's access ports.

The reduce module app_cal. And the computing module is used for simulating the GPU to perform the reduction computation.

The NIC, i.e. the network card, is a communication module for connecting a Host to a Host, and a Host to a Switch (a Switch in a network topology), and may have a plurality of numbers, which are denoted by num_nic. The NIC comprises:

broadcast port module bcast. The method is used for realizing the broadcasting protocol in the network card, and can avoid delay increase and throughput waste of repeated distribution of the traffic.

An internal port module inner. Refers to a channel used for connecting the network card and the PCIe port module and used for communication between the network card and other modules.

Network card routing module NICROUTER. The method is used for realizing communication behaviors among different modules in the network card.

And splitting the protocol module Protocal. The method is used for simulating the packetization flow of communication protocols such as TCP/UDP/IB and the like.

And an output port module. The method is used for connecting the Host or Switch outside the Host, and the communication process between the hosts is realized.

And merging protocol modules Bond. And the module is used for realizing the reverse function of the split protocol module and combining a plurality of protocol packets into original data.

The embodiment also provides a network simulation system for distributed training, and the network simulation device for distributed training is used.

The foregoing is merely an embodiment of the present invention, the present invention is not limited to the field of this embodiment, and the specific structures and features well known in the schemes are not described in any way herein, so that those skilled in the art will know all the prior art in the field before the application date or priority date, and will have the capability of applying the conventional experimental means before the date, and those skilled in the art may, in light of the teaching of this application, complete and implement this scheme in combination with their own capabilities, and some typical known structures or known methods should not be an obstacle for those skilled in the art to practice this application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. A network simulation method for distributed training, comprising the steps of:

step 110, reading parameters in a configuration file when the simulator runs;

step 150, a simulator runs an algorithm, and alternately simulates the behavior of the collective communication algorithm according to the information provided by the steps 120-140 and different data sizes and times, and specifically comprises the following steps:

step 211, initializing parameters required by operation; parameters required by the initialization operation comprise the number of current nodes, the number of GPU in the nodes, the current algorithm and the fusion size;

step 212, starting to operate after the parameter initialization is finished;

step 213, sending a signal for starting Allreduce;

step 230, starting to process the next new allreduce;

step 231, resetting step of allreduce, and increasing index of allreduce;

step 273, ending the current whole program;

step 276, scheduling the communication time calculated in step 275 to transmit the signal of allreduce start, and jumping to step 220

step 170, the simulation program ends, and the final simulation time is output.

2. The network simulation method for distributed training of claim 1, wherein: in step 110, the parameters include an application name, a topology parameter, an algorithm name, a data name, a training neural network name, and a batch size.

3. The network simulation method for distributed training of claim 1, wherein: the data amount and the communication time of the next communication calculated in the step 275 are specifically:

4. A network simulation device for distributed training, using the method of any of claims 1-3, characterized by comprising a host, a computing accelerator and a network card;

the network card is used for connecting the host computers.

5. The network simulation apparatus for distributed training of claim 4, wherein: the computing accelerator further comprises a PCIe port module, an Nvlink port module and a GPU internal routing module;

6. The network simulation apparatus for distributed training of claim 4, wherein: the network card comprises:

7. A network simulation system for distributed training, characterized by: use of a network simulation device for distributed training according to any of the claims 4-6.

8. A storage medium storing a computer program which, when executed by a processor, implements the steps of the method of any one of claims 1-3.