CN115437760A - Computing resource allocation method, electronic device, storage medium, and program product - Google Patents

Computing resource allocation method, electronic device, storage medium, and program product Download PDF

Info

Publication number
CN115437760A
CN115437760A CN202210888391.XA CN202210888391A CN115437760A CN 115437760 A CN115437760 A CN 115437760A CN 202210888391 A CN202210888391 A CN 202210888391A CN 115437760 A CN115437760 A CN 115437760A
Authority
CN
China
Prior art keywords
communication
computing
task
tasks
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210888391.XA
Other languages
Chinese (zh)
Inventor
高华佐
丁劭华
王彪
许欣然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kuangshi Technology Co Ltd
Original Assignee
Beijing Kuangshi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kuangshi Technology Co Ltd filed Critical Beijing Kuangshi Technology Co Ltd
Priority to CN202210888391.XA priority Critical patent/CN115437760A/en
Publication of CN115437760A publication Critical patent/CN115437760A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The application discloses a computing resource allocation method, electronic equipment, a storage medium and a program product, and relates to the technical field of artificial intelligence, in particular to the technical field of model training. The method is used for training the neural network model, and comprises the following steps: acquiring the computing time for executing the computing task and the communication time for executing the communication task by the sub-model on the training node; obtaining delay steps according to the calculation time and the communication time; according to the delay step number, the size of the data sample and the number of the training nodes, obtaining a scheduling scheme of a plurality of computing tasks and a plurality of communication tasks executed on the training nodes; generating a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme; and controlling the training nodes according to the scheduling instruction sequence to train the sub-models deployed on the training nodes according to the data samples. The training efficiency of the neural network model for distributed training can be improved.

Description

Computing resource allocation method, electronic device, storage medium, and program product
Technical Field
The present application relates generally to the field of artificial intelligence technologies, and more particularly, to the field of model training technologies, and more particularly, to a computing resource allocation method, an electronic device, a storage medium, and a program product.
Background
With the continuous development of deep learning technology, the parameters of the neural network model are gradually increased. The increase in the number of parameters requires the use of multiple computing devices for a distributed training model, namely: the training of the model is accomplished using storage and computing resources of multiple computing devices using a model-parallel distributed training technique.
In the related technology, mainly there are tensor model parallel and pipeline model parallel distributed training technologies, and these two parallel technologies divide the whole model calculation task into subtasks, and allocate the subtasks to different devices for execution. Tensor models split a single operator into different devices in parallel, pipeline models split different layers of the models into different devices in parallel, communication operators exchange respective calculation results, and multiple devices cooperate to complete calculation tasks required by large model training. The tensor model parallel scheme has relatively higher requirement on communication bandwidth among devices, and has low training efficiency under the condition of insufficient bandwidth, and the pipeline model parallel scheme has lower communication traffic than the tensor model parallel scheme, but has high communication ratio and can not parallel the calculation and communication processes, so the communication overhead is still high, and the training efficiency of the model is influenced.
Disclosure of Invention
In view of the foregoing drawbacks and deficiencies of the prior art, it is desirable to provide a computing resource allocation method, an electronic device, a storage medium, and a program product, which can improve the training efficiency of the distributed training of the neural network model.
In a first aspect, the present application provides a computing resource allocation method, where the computing resource allocation method is used for training a neural network model, where the neural network model includes a plurality of submodels, and the submodels are deployed on a plurality of training nodes in a one-to-one correspondence, where the method includes:
acquiring the calculation time of the submodel on the training node for executing the calculation task and the communication time for executing the communication task;
obtaining delay steps according to the calculation time and the communication time;
obtaining a scheduling scheme of a plurality of computing tasks and a plurality of communication tasks executed on the training nodes according to the delay steps, the size of a data sample and the number of the training nodes, wherein the size of the data sample comprises the total size of the data sample and the size of subdata split by the data sample;
generating a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme;
and controlling the training nodes to train the submodels deployed on the training nodes according to the data samples according to the scheduling instruction sequence.
In some examples, the deriving a number of delay steps from the computation time and the communication time includes:
comparing the calculated time and the communication time;
setting the delay step number as a first delay step number if the communication time is less than the calculation time, and setting the delay step number as a second delay step number if the communication time is not less than the calculation time, wherein the first delay step number is less than the second delay step number.
In some examples, the first number of delay steps is 2 and the second number of delay steps is 3.
In some examples, the obtaining a scheduling scheme for a plurality of computation tasks and a plurality of communication tasks performed on the training nodes according to the number of delay steps, the size of the data sample, and the number of the plurality of training nodes includes:
obtaining a plurality of computing tasks executed on the training nodes according to the delay steps, the size of the data sample and the number of the training nodes, and sequencing the computing tasks;
setting a communication task for each of the plurality of computing tasks to obtain the plurality of computing tasks and the sequence of the plurality of communication tasks;
and obtaining a scheduling scheme of the plurality of computing tasks and the plurality of communication tasks according to the sequence of the plurality of computing tasks and the plurality of communication tasks.
In some examples, the generating a sequence of scheduling instructions for the plurality of computing tasks and a plurality of communication tasks according to the scheduling scheme includes:
converting each of the plurality of computing tasks and plurality of communication tasks into a scheduling instruction according to the ordering of the plurality of computing tasks and plurality of communication tasks in the scheduling scheme,
and obtaining the scheduling instruction sequence according to the scheduling instruction converted by each task.
In some examples, the setting a communication task for each of the plurality of computing tasks, resulting in an ordering of the plurality of computing tasks and the plurality of communication tasks, includes:
judging whether the communication task has a dependency relationship with the computing task;
if the communication task and the computing task do not have a dependency relationship, the communication task and the computing task are arranged in parallel;
and if the communication task and the computing task have a dependency relationship, the communication task and the computing task are arranged in series.
In some examples, the determining whether the communication task has a dependency relationship with the computing task includes:
if the data to be communicated of the communication task is data generated after the calculation task is executed, or the calculation task is executed according to the data to be communicated of the communication task, the communication task and the calculation task have the dependency relationship, otherwise, the communication task and the calculation task do not have the dependency relationship.
In a second aspect, the present application provides a computing resource allocation apparatus, where the computing resource allocation apparatus is used for training a neural network model, where the neural network model includes a plurality of submodels, and the submodels are deployed on a plurality of training nodes in a one-to-one correspondence, and the computing resource allocation apparatus includes:
the acquisition module is used for acquiring the calculation time of the submodel on the training node for executing the calculation task and the communication time for executing the communication task;
the delay step number calculation module is used for obtaining delay step numbers according to the calculation time and the communication time;
a scheduling scheme obtaining module, configured to obtain a scheduling scheme for a plurality of computation tasks and a plurality of communication tasks executed on the training nodes according to the delay step number, the size of the data sample, and the number of the training nodes, where the size of the data sample includes a total size of the data sample and a size of sub data split by the data sample;
a scheduling instruction generating module, configured to generate a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme;
and the scheduling instruction execution module is used for controlling the training nodes to train the submodels deployed on the training nodes according to the data samples according to the scheduling instruction sequence.
In some examples, the delay step number calculation module is specifically configured to:
comparing the calculated time and the communication time;
setting the delay step number as a first delay step number if the communication time is less than the calculation time, and setting the delay step number as a second delay step number if the communication time is not less than the calculation time, wherein the first delay step number is less than the second delay step number.
In some examples, the scheduling scheme obtaining module is specifically configured to:
obtaining a plurality of computing tasks executed on the training nodes according to the delay steps, the size of the data sample and the number of the training nodes, and sequencing the computing tasks;
setting a communication task for each of the plurality of computing tasks to obtain the plurality of computing tasks and the sequence of the plurality of communication tasks;
and obtaining a scheduling scheme of the plurality of computing tasks and the plurality of communication tasks according to the sequence of the plurality of computing tasks and the plurality of communication tasks.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the computing resource allocation method as described in the embodiment of the present application.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the computing resource allocation method as described in the present application.
In a fifth aspect, the present application provides a computer program product, on which a computer program is stored, and the computer program, when executed by a processor, implements the computing resource allocation method as described in the embodiments of the present application.
According to the computing resource allocation method, the electronic device, the storage medium and the program product, different delay step numbers are determined according to the difference between the computing time of executing the computing task and the communication time of executing the communication task by the submodel on the training nodes, then the scheduling schemes of the multiple computing tasks and the multiple communication tasks on each training node can be determined according to the delay step numbers, the size of the data samples and the number of the training nodes, the scheduling instruction sequences are generated for the multiple computing tasks and the multiple communication tasks according to the scheduling schemes, finally the training nodes are controlled according to the scheduling instruction sequences to train the submodel deployed on the training nodes according to the data samples, and the computing tasks and the communication tasks which are not dependent on each other during model training can be executed in parallel through the scheduling schemes, so that the communication overhead during distributed training of the model is reduced, and the training efficiency of the distributed training of the neural network model is improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 is a schematic flowchart of a computing resource allocation method according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a scheduling scheme with a delay step number of 2 for a computing resource allocation method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a scheduling scheme with a delay step number of 3 for a method for allocating computing resources according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an execution of an executor executed by a sequence of instructions of a computing resource allocation method according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a computing resource allocation apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The application provides a computing resource allocation method, an electronic device, a storage medium and a program product, which can enable a computing task and a communication task without dependency relationship to be executed in parallel during model training, thereby reducing network communication overhead during distributed training of the model and improving training efficiency of the distributed training of a neural network model.
According to the implementation environment of the embodiment of the application, personal computer equipment and the like can obtain the computing time for executing the computing task and the communication time for executing the communication task by the sub-model on the training node; obtaining delay steps according to the calculation time and the communication time; obtaining a scheduling scheme of a plurality of computing tasks and a plurality of communication tasks executed on the training nodes according to the delay steps, the size of a data sample and the number of the training nodes, wherein the size of the data sample comprises the total size of the data sample and the size of subdata split by the data sample; generating a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme; and controlling the training node to train the sub-models deployed on the training node according to the data samples according to the scheduling instruction sequence.
The embodiment of the application provides a computing resource allocation method, which is used for training a neural network model, wherein the neural network model comprises a plurality of submodels, and the submodels are deployed on a plurality of training nodes in a one-to-one correspondence manner.
In the following description:
steam means: executing a data structure of host sending commands in sequence in a parallel computing frame;
batch _ size refers to: the neural network model trains the data batch size (i.e., the total size of the data samples);
micro _ batch _ size means: the training data (i.e. data samples) with larger batch size is divided into a plurality of small batch data, the gradient of each small batch data to the model parameter is calculated separately in the training process, and the gradients are summed or averaged. The batch size of the small batch of data is micro _ batch _ size;
tensor means: a tensor/high-dimensional array;
forward (F) means: a forward computing task, forward computing of the neural network model;
back ward (B) means: a reverse calculation task, namely reverse calculation of a neural network model;
send (S) means: a sending task for sending the tensor (namely, the output of the training node) to other equipment (namely, other training nodes);
receive (R) means: receiving a tensor from other equipment;
pipeine stage refers to: the assembly line stage is to split the complete deep neural network model into a plurality of submodels which need to be executed in sequence, wherein one submodel is an assembly line stage;
stage _ id refers to: a pipeline stage unique identifier;
the command queue is used for storing a group of containers of commands required to be executed in sequence;
tensorstorage tensor storage, a container for storing tensors by key value pairs;
the event is used for monitoring the synchronous marking of the execution progress of the equipment task;
eventstorage events are stored, and key value pairs are used for storing the containers of the events.
The computing resource allocation method may be applied to a computer device. The invention provides a computing resource allocation method, aiming at solving the problem of low training performance caused by high communication overhead in the distributed training of a neural network model. In the related art, under the condition of lower bandwidth among computing devices (namely training nodes), the pipeline parallelism has better performance because of less transmission data. In a pipeline parallel scheduling scheme in the related art, due to data dependence between adjacent computing tasks (forward computing task and backward computing task) and communication tasks (send task and receive task), the pipeline parallel scheduling scheme can only be executed in series, and the overall performance is poor. The computing resource allocation method effectively reduces the communication overhead by reasonably arranging the execution sequence of the computing tasks and the communication tasks and utilizing the mode of parallel execution of the computing tasks and the communication tasks without dependency relationship, thereby improving the training efficiency of the distributed training of the neural network model. The invention can effectively improve the resource utilization rate and improve the training process under the condition of limited communication bandwidth.
The computing resource allocation method of the invention reasonably arranges the execution interval between the forward computing task forward and the backward computing task backward with dependency relationship, so that the computing task and the communication task can be executed in parallel, and the efficiency of distributed training is improved, as shown in fig. 1, the method comprises the following steps:
101. and obtaining the computing time of the sub-model on the training node for executing the computing task and the communication time for executing the communication task.
Wherein, the premise of this step is the distributed training of the neural network, namely: the neural network model comprises a plurality of submodels which are deployed on a plurality of training nodes in a one-to-one correspondence mode. Specifically, when the neural network model is split and pipeline parallel distributed training is performed by using N computing devices (namely N training nodes), the neural network model is uniformly split into N pipeline expressions (namely N sub-models), the N sub-models are configured on the N training nodes in a one-to-one correspondence manner, and the serial number of each pipeline stage is represented by a stage _ id, wherein the id is 0 to N-1. When the neural network model is executed in a pipeline mode, a forward computing task needs N pipelines with the stage _ id from small to large to sequentially execute forward computing tasks forward, the computing result of each pipeline is sent to the next pipeline through a communication task (namely, sending task send) to be used as input, and the next pipeline receives the computing result of the previous pipeline through a communication task (namely, receiving task receive); similarly, when the reverse computation task acquires the gradient, the stage _ id needs to be executed in sequence from large to small, and accordingly the gradient is sent to the previous pipeline as input through the communication task.
Based on the above premises, the computation time for the sub-model on the training node to execute the computation task and the communication time for executing the communication task are obtained, for example, the time spent by a single pipeline to execute a single forward, backward, send and receive (i.e. the computation time for executing the computation task and the communication time for executing the communication task) is measured on the actually deployed computing device.
102. And obtaining the delay step number according to the calculation time and the communication time.
In one embodiment of the invention, the calculated time and the communication time are compared; setting the delay step number as a first delay step number if the communication time is less than the calculation time, and setting the delay step number as a second delay step number if the communication time is not less than the calculation time, wherein the first delay step number is less than the second delay step number. In a specific example, the first number of delay steps is, but not limited to, 2, and the second number of delay steps is, but not limited to, 3.
Specifically, it is determined whether the time (communication time) spent in the single sending task send and receiving task receive operation is less than the calculation time of the single forward calculation task forward, where the delay step number k is, for example, 2 if the time is less than the calculation time, and otherwise, the delay step number k is, for example, 3.
103. According to the delay step number, the size of a data sample and the number of a plurality of training nodes, obtaining a scheduling scheme of a plurality of computing tasks and a plurality of communication tasks executed on the training nodes, wherein the size of the data sample comprises the total size of the data sample and the size of subdata split by the data sample.
In a specific example, obtaining a scheduling scheme of a plurality of computation tasks and a plurality of communication tasks executed on the training nodes according to the number of delay steps, the size of the data sample, and the number of the training nodes includes: obtaining a plurality of computing tasks executed on the training nodes according to the delay steps, the size of the data sample and the number of the training nodes, and sequencing the computing tasks; setting a communication task for each of the plurality of computing tasks to obtain the plurality of computing tasks and the sequence of the plurality of communication tasks; and obtaining a scheduling scheme of the plurality of computing tasks and the plurality of communication tasks according to the sequence of the plurality of computing tasks and the plurality of communication tasks.
In this example, setting a communication task for each of the plurality of computing tasks to obtain the sequence of the plurality of computing tasks and the plurality of communication tasks includes: judging whether the communication task has a dependency relationship with the computing task; if the communication task and the computing task do not have a dependency relationship, the communication task and the computing task are arranged in parallel; and if the communication task and the computing task have a dependency relationship, serially arranging the communication task and the computing task.
Judging whether the communication task has a dependency relationship with the computing task or not comprises the following steps: if the data to be communicated of the communication task is data generated after the calculation task is executed, or the calculation task is executed according to the data to be communicated of the communication task, the communication task and the calculation task have the dependency relationship, otherwise, the communication task and the calculation task do not have the dependency relationship.
Specifically, a scheduling scheme of a plurality of computation tasks and a plurality of communication tasks executed on the training nodes is generated according to the delay step number k (namely: delay-k), the pipeline total number N (namely: the number of training nodes), the batch _ size (namely: the total size of the data samples), and the micro _ batch _ size (namely: the size of the sub data split by the data samples).
According to the model training process, in a complete training iteration (calculating the gradient of the model parameter corresponding to the training data with the batch size of batch _ size), each pipeline needs to execute batch _ size/micro _ batch _ size forward calculation task forward and backward calculation task backward. Task scheduling requires the proper arrangement of these computing tasks and the communication tasks corresponding to the computing tasks.
The generation of the scheduling scheme, that is, the generation of the arrangement scheme, is completed by the following substeps:
the forward and backward sequences are discharged. In a specific example of the present invention, the forward and backward order is arranged as follows: the number of computation tasks in the intermediate interval between forward and backward (backward means backward depends on forward) is (N-1-X) K.
Corresponding send and/or receive tasks are arranged for each forward and backward task, and computation and communication do not depend on the fact that the tasks can be arranged on different streams in parallel, so that a delay-k scheduling arrangement scheme (namely, a scheduling scheme) is generated. The schematic of the arrangement of delay-2 (i.e. delay step number of 2) and delay-3 (i.e. delay step number of 3) is shown in fig. 2 and 3, respectively, wherein in fig. 2 and 3, the horizontal axis represents time, the vertical axis represents pipeling of different stage _ ids, F represents forward, B represents backward, S represents send, and R represents receive. As shown in fig. 2 and fig. 3, in a steady state (i.e. in a phase in which forward and backward tasks are executed alternately), delay-2 and delay-3 can hide communication overhead by executing forward and backward tasks and send and receive tasks on different streams, thereby improving training efficiency.
104. And generating a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme.
In a specific example, according to the sequence of the plurality of computing tasks and the plurality of communication tasks in the scheduling scheme, each of the plurality of computing tasks and the plurality of communication tasks is converted into a scheduling instruction, and the scheduling instruction sequence is obtained according to the scheduling instruction converted by each task.
In particular, the delay-k scheduling arrangement translates into a sequence of commands that are convenient for the actuators to execute. And sequencing all tasks in the arrangement scheme according to a topological order of data dependence, and converting each task into a corresponding command to obtain a command sequence required by an actuator. The attributes contained in the command and the meaning of each attribute are shown in table 1.
TABLE 1
Figure BDA0003765838040000121
105. And controlling the training node to train the sub-models deployed on the training node according to the data samples according to the scheduling instruction sequence.
Namely: the training of the model is performed according to the command sequence generated by the delay-k scheduling scheme.
A schematic diagram of the sequence of commands executed by the actuator is shown in fig. 4. The executor is composed of a commandqueue, tensstorage (tensor for storing command execution generation), and eventstorage (for storing command execution generation events). The executor first preprocesses the command sequence, marks the first and the last commands of the hit needing the same batch execution, and for the commands with the same comm _ group attribute value, the first one marks as the beginning of the group and the last one marks as the end of the group. After preprocessing, each command in the commandqueue is traversed and executed, and the processing flow of the executor on one command is as follows:
and judging whether the command wait attribute value is null, if not, acquiring an event from the event _ storage by taking the wait attribute value as a key, and continuing to execute the subsequent flow after the event execution state is finished.
And judging whether the command is a group start command or not, and if so, calling a group _ start function to enter the batch processing communication operation area.
And calling a corresponding task execution function according to the command type attribute. type is forward, backward, receive, tenor generated by task execution is stored in tenor _ storage, and a storage key is a command tag attribute. And if the type is send, taking out and deleting the tenar corresponding to the tag attribute of the command from the tenar _ storage, and sending the tenar.
And judging whether the command is a group ending command or not, and if so, calling a group _ end function to exit the batch processing communication operation area.
And judging whether the attribute value of the command reader _ stream is null, and if not, moving the pointer generated by the command execution to the stream where the reader _ stream is located.
And judging whether the command record attribute is null, if the command record attribute is not null, creating a device event, storing the device event into the event _ storage, wherein a storage key is the value of the command record attribute.
According to the computing resource allocation method provided by the embodiment of the invention, different delay step numbers are determined according to the difference between the computing time of executing the computing task and the communication time of executing the communication task by the submodel on the training node, then the scheduling schemes of a plurality of computing tasks and a plurality of communication tasks on each training node can be determined according to the delay step numbers, the size of the data sample and the number of the training nodes, the scheduling instruction sequences are generated for the plurality of computing tasks and the plurality of communication tasks according to the scheduling schemes, finally, the training nodes are controlled according to the scheduling instruction sequences to train the submodel deployed on the training nodes according to the data sample, and the computing tasks and the communication tasks which are not dependent on each other during model training can be executed in parallel through the scheduling schemes, so that the communication overhead during distributed training of the model is reduced, and the training efficiency of the distributed training of the neural network model is improved.
FIG. 5 is a block diagram of a computing resource allocation apparatus according to an embodiment of the present application.
As shown in fig. 5, the computing resource allocation apparatus is used for training the neural network model, and includes: an obtaining module 510, a delay step number calculating module 520, a scheduling scheme obtaining module 530, a scheduling instruction generating module 540, and a scheduling instruction executing module 550, wherein:
an obtaining module 510, configured to obtain computation time for a sub-model on a training node to perform a computation task and communication time for a communication task;
a delay step number calculating module 520, configured to obtain a delay step number according to the calculation time and the communication time;
a scheduling scheme obtaining module 530, configured to obtain a scheduling scheme for a plurality of computation tasks and a plurality of communication tasks executed on the training nodes according to the number of delay steps, the size of a data sample, and the number of the training nodes, where the size of the data sample includes a total size of the data sample and a size of sub data split by the data sample;
a scheduling instruction generating module 540, configured to generate a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme;
and a scheduling instruction executing module 550, configured to control the training node to train the submodel deployed on the training node according to the data sample according to the scheduling instruction sequence.
In an embodiment of the present invention, the delay step number calculating module 520 is specifically configured to:
comparing the calculated time and the communication time;
setting the delay step number as a first delay step number if the communication time is less than the calculation time, and setting the delay step number as a second delay step number if the communication time is not less than the calculation time, wherein the first delay step number is less than the second delay step number.
In an embodiment of the present invention, the scheduling scheme obtaining module 530 is specifically configured to:
obtaining a plurality of computing tasks executed on the training nodes according to the delay steps, the size of the data sample and the number of the training nodes, and sequencing the computing tasks;
setting a communication task for each of the plurality of computing tasks to obtain the plurality of computing tasks and the sequence of the plurality of communication tasks;
and obtaining a scheduling scheme of the plurality of computing tasks and the plurality of communication tasks according to the sequence of the plurality of computing tasks and the plurality of communication tasks.
According to the computing resource allocation device provided by the embodiment of the invention, different delay step numbers are determined according to the difference between the computing time of executing the computing task and the communication time of executing the communication task by the submodel on the training node, then the scheduling schemes of a plurality of computing tasks and a plurality of communication tasks on each training node can be determined according to the delay step numbers, the size of the data sample and the number of the training nodes, the scheduling instruction sequences are generated for the plurality of computing tasks and the plurality of communication tasks according to the scheduling schemes, finally, the training nodes are controlled according to the scheduling instruction sequences to train the submodel deployed on the training nodes according to the data sample, and the computing tasks and the communication tasks which are not dependent on each other during model training can be executed in parallel through the scheduling schemes, so that the communication overhead during distributed training of the model is reduced, and the training efficiency of the distributed training of the neural network model is improved.
It should be understood that the units recited in the computing resource allocation apparatus correspond to the respective steps in the computing resource allocation method described with reference to fig. 1. Thus, the operations and features described above for the method are also applicable to the computing resource allocation apparatus and the units included therein, and are not described in detail here. The computing resource allocation apparatus may be implemented in a browser or other security applications of the computer device in advance, or may be loaded into the browser or other security applications of the computer device by downloading or the like. Corresponding elements of the computing resource allocation arrangement may cooperate with elements of the computer device to implement aspects of embodiments of the present application.
The division into several modules or units mentioned in the above detailed description is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
It should be noted that, please refer to the details disclosed in the above embodiments of the present application for details not disclosed in the computing resource allocation apparatus in the embodiment of the present application, which are not described herein again.
Referring now to fig. 6, fig. 6 shows a schematic block diagram of a computer device suitable for implementing embodiments of the present application, and as shown in fig. 6, a computer system 1300 includes a Central Processing Unit (CPU) 1301 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1302 or a program loaded from a storage portion 1308 into a Random Access Memory (RAM) 1303. In the RAM1303, various programs and data necessary for operation instructions of the system are also stored. The CPU1301, the ROM1302, and the RAM1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.
The following components are connected to the I/O interface 1305; an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a network interface card such as a LAN card, a modem, or the like. The communication section 1309 performs communication processing via a network such as the internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1308 as necessary.
In particular, according to embodiments of the present application, the process described above with reference to the flowchart of fig. 1 may be implemented as a computer software program. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program comprises program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program executes the above-described functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1301.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operational instructions of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes a first receive module, a second receive module, and a send module. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.
As another aspect, the present application also provides a computer-readable storage medium, which may be included in the computer device described in the above embodiments, or may exist separately without being assembled into the computer device. The computer-readable storage medium stores one or more programs which, when executed by one or more processors, perform the computing resource allocation methods described herein.
As another aspect, the present application also provides a computer program product, which may be included in the electronic device described in the above embodiments, or may exist alone without being assembled into the electronic device. The computer program product stores one or more programs that, when executed by one or more processors, perform the allocation of computing resources described herein.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A computing resource allocation method for training a neural network model, the neural network model comprising a plurality of submodels, the plurality of submodels being deployed on a plurality of training nodes in a one-to-one correspondence, the method comprising:
acquiring the computing time for executing the computing task and the communication time for executing the communication task by the sub-model on the training node;
obtaining delay steps according to the calculation time and the communication time;
obtaining a scheduling scheme of a plurality of computing tasks and a plurality of communication tasks executed on the training nodes according to the delay steps, the size of a data sample and the number of the training nodes, wherein the size of the data sample comprises the total size of the data sample and the size of subdata split by the data sample;
generating a scheduling instruction sequence for the plurality of computing tasks and the plurality of communication tasks according to the scheduling scheme;
and controlling the training node to train the sub-models deployed on the training node according to the data samples according to the scheduling instruction sequence.
2. The method according to claim 1, wherein said deriving a number of delay steps from the computation time and the communication time comprises:
comparing the calculated time and the communication time;
setting the delay step number as a first delay step number if the communication time is less than the calculation time, and setting the delay step number as a second delay step number if the communication time is not less than the calculation time, wherein the first delay step number is less than the second delay step number.
3. The method according to claim 2, wherein the first number of delay steps is 2 and the second number of delay steps is 3.
4. The method according to any of claims 1-3, wherein the obtaining a scheduling scheme for a plurality of computation tasks and a plurality of communication tasks performed on the training nodes according to the number of delay steps, the size of data samples, and the number of the training nodes comprises:
obtaining a plurality of computing tasks executed on the training nodes according to the delay steps, the size of the data sample and the number of the training nodes, and sequencing the computing tasks;
setting a communication task for each of the plurality of computing tasks to obtain the plurality of computing tasks and the sequence of the plurality of communication tasks;
and obtaining a scheduling scheme of the plurality of computing tasks and the plurality of communication tasks according to the sequence of the plurality of computing tasks and the plurality of communication tasks.
5. The method according to claim 4, wherein said generating a sequence of scheduling instructions for said plurality of computing tasks and plurality of communication tasks according to said scheduling scheme comprises:
converting each of the plurality of computing tasks and plurality of communication tasks into a scheduling instruction according to the ordering of the plurality of computing tasks and plurality of communication tasks in the scheduling scheme,
and obtaining the scheduling instruction sequence according to the scheduling instruction converted by each task.
6. The method according to claim 4, wherein the setting a communication task for each of the plurality of computing tasks, and obtaining the ordering of the plurality of computing tasks and the plurality of communication tasks, comprises:
judging whether the communication task has a dependency relationship with the computing task;
if the communication task and the computing task do not have a dependency relationship, the communication task and the computing task are arranged in parallel;
and if the communication task and the computing task have a dependency relationship, the communication task and the computing task are arranged in series.
7. The method according to claim 6, wherein said determining whether the communication task has a dependency relationship with the computing task comprises:
if the data to be communicated of the communication task is data generated after the calculation task is executed, or the calculation task is executed according to the data to be communicated of the communication task, the communication task and the calculation task have the dependency relationship, otherwise, the communication task and the calculation task do not have the dependency relationship.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being configured to implement the method of allocating computational resources according to any one of claims 1 to 7 when executing the program.
9. A computer-readable storage medium, having stored thereon a computer program for implementing the method of computing resource allocation according to any one of claims 1-7.
10. A computer program product, characterized in that a computer program is stored thereon for implementing a method of allocating computing resources according to any one of claims 1-7.
CN202210888391.XA 2022-07-26 2022-07-26 Computing resource allocation method, electronic device, storage medium, and program product Pending CN115437760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210888391.XA CN115437760A (en) 2022-07-26 2022-07-26 Computing resource allocation method, electronic device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210888391.XA CN115437760A (en) 2022-07-26 2022-07-26 Computing resource allocation method, electronic device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN115437760A true CN115437760A (en) 2022-12-06

Family

ID=84241595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210888391.XA Pending CN115437760A (en) 2022-07-26 2022-07-26 Computing resource allocation method, electronic device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN115437760A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796284A (en) * 2023-02-08 2023-03-14 苏州浪潮智能科技有限公司 Inference method, inference device, storage medium and equipment based on TVM compiler
CN116956756A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Model deployment method, task processing method, device, equipment and storage medium
CN117608866A (en) * 2024-01-24 2024-02-27 山东博商缘信息科技发展有限公司 Data collaborative processing method and system based on large model

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796284A (en) * 2023-02-08 2023-03-14 苏州浪潮智能科技有限公司 Inference method, inference device, storage medium and equipment based on TVM compiler
CN116956756A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Model deployment method, task processing method, device, equipment and storage medium
CN116956756B (en) * 2023-09-21 2024-02-09 浪潮电子信息产业股份有限公司 Model deployment method, task processing method, device, equipment and storage medium
CN117608866A (en) * 2024-01-24 2024-02-27 山东博商缘信息科技发展有限公司 Data collaborative processing method and system based on large model
CN117608866B (en) * 2024-01-24 2024-05-03 山东博商缘信息科技发展有限公司 Data collaborative processing method and system based on large model

Similar Documents

Publication Publication Date Title
CN109993299B (en) Data training method and device, storage medium and electronic device
CN110533183B (en) Task placement method for heterogeneous network perception in pipeline distributed deep learning
CN111756812B (en) Energy consumption perception edge cloud cooperation dynamic unloading scheduling method
US11989647B2 (en) Self-learning scheduler for application orchestration on shared compute cluster
CN115437760A (en) Computing resource allocation method, electronic device, storage medium, and program product
CN107888669B (en) Deep learning neural network-based large-scale resource scheduling system and method
US10884795B2 (en) Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
WO2024060789A1 (en) Intelligent computing-oriented method, system and apparatus for scheduling distributed training tasks
Yu et al. Gillis: Serving large neural networks in serverless functions with automatic model partitioning
US11055139B2 (en) Smart accelerator allocation and reclamation for deep learning jobs in a computing cluster
JPH05290005A (en) Load uniformalizing method for parallel processing
CN115543639A (en) Optimization method for distributed execution of deep learning task and distributed system
CN112416585A (en) GPU resource management and intelligent scheduling method for deep learning
US20240111586A1 (en) Multi-policy intelligent scheduling method and apparatus oriented to heterogeneous computing power
CN115994567B (en) Asynchronous scheduling method for parallel computing tasks of deep neural network model
CN113472597B (en) Distributed convolutional neural network fine-grained parameter transmission scheduling method and device
US11579924B2 (en) Scheduling artificial intelligence model partitions based on reversed computation graph
Dong et al. Characterizing the microarchitectural implications of a convolutional neural network (cnn) execution on gpus
Song et al. Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small
WO2024041400A1 (en) Model training task scheduling method and apparatus, and electronic device
Wang et al. Auto-MAP: A DQN framework for exploring distributed execution plans for DNN workloads
CN113076181B (en) Data processing flow optimization method, system and storage medium
CN115756789A (en) GPU scheduling optimization method for deep learning inference service system
Lößer et al. Bottlemod: Modeling data flows and tasks for fast bottleneck analysis
Li et al. Predicting throughput of distributed stochastic gradient descent

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination