WO2024041400A1 - 模型训练任务的调度方法、装置及电子设备 - Google Patents

模型训练任务的调度方法、装置及电子设备 Download PDF

Info

Publication number
WO2024041400A1
WO2024041400A1 PCT/CN2023/112568 CN2023112568W WO2024041400A1 WO 2024041400 A1 WO2024041400 A1 WO 2024041400A1 CN 2023112568 W CN2023112568 W CN 2023112568W WO 2024041400 A1 WO2024041400 A1 WO 2024041400A1
Authority
WO
WIPO (PCT)
Prior art keywords
model training
task
scheduling
resources
resource
Prior art date
Application number
PCT/CN2023/112568
Other languages
English (en)
French (fr)
Inventor
刘渊强
赵怡浩
彭杨华
朱亦博
Original Assignee
抖音视界有限公司
脸萌有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 抖音视界有限公司, 脸萌有限公司 filed Critical 抖音视界有限公司
Publication of WO2024041400A1 publication Critical patent/WO2024041400A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of machine learning technology, and in particular to a scheduling method, device and electronic equipment for model training tasks.
  • Deep learning models have huge differences in model size and type.
  • Various resources may become the bottleneck of the deep learning model training task, making the training process of the deep learning model resource-intensive.
  • the utilization rate is low and training efficiency is difficult to improve.
  • a method that can effectively improve model training efficiency is needed.
  • the present disclosure provides a scheduling method, device and electronic equipment for model training tasks.
  • a method for scheduling model training tasks includes:
  • the target task group includes multiple model training tasks to be processed
  • the task scheduling information includes the processing order of the multiple model training tasks
  • the multiple model training tasks are scheduled to use the multiple model training resources in parallel, so that different model training tasks use different model training resources at the same time.
  • a scheduling device for model training tasks includes:
  • An acquisition module is used to determine a target task group; the target task group includes multiple model training tasks to be processed;
  • Determining module used to determine task scheduling information; the task scheduling information includes the processing order of the multiple model training tasks;
  • a scheduling module configured to schedule the multiple model training tasks to use the multiple model training tasks in parallel based on the task scheduling information.
  • Model training resources enable different model training tasks to use different model training resources at the same time.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above-mentioned first aspects is implemented.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, any one of the requirements of the first aspect is achieved. method described.
  • Embodiments of the present disclosure provide a method and device for scheduling model training tasks.
  • scheduling multiple model training tasks in a task group to multiple model training resources of different types for parallel processing different model training tasks can be processed in parallel.
  • Using different model training resources at the same time avoids competition for model training resources between different model training tasks, improves the utilization of model training resources, and improves the efficiency of model training.
  • Figure 1 is a schematic structural diagram of a model training system according to an exemplary embodiment of the present disclosure
  • Figure 2 is a flow chart of a method for scheduling model training tasks according to an exemplary embodiment of the present disclosure
  • Figure 3A is a flow chart of another method for scheduling model training tasks according to an exemplary embodiment of the present disclosure
  • Figure 3B is a schematic diagram of a scheduling scenario of a model training task according to an exemplary embodiment of the present disclosure
  • Figure 3C is a schematic diagram of a scheduling scenario of another model training task according to an exemplary embodiment of the present disclosure.
  • Figure 4 is a block diagram of a scheduling device for model training tasks according to an exemplary embodiment of the present disclosure
  • Figure 5 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure.
  • Figure 6 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
  • Figure 7 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • Deep learning models vary greatly in model size and type, and multiple resources may become bottlenecks in deep learning model training tasks.
  • deep learning model training tasks are usually made to exclusively occupy various resources, or only the sharing of GPU resources is considered.
  • exclusive resources or resource allocation solutions that only consider GPU resource sharing can improve the speed of deep learning training (i.e., training throughput) to a certain extent. ).
  • using the same GPU resources for different model training tasks will lead to resource contention, increase resource usage and task completion time, and thus reduce the efficiency of model training.
  • the present disclosure provides a scheduling method for model training tasks.
  • scheduling multiple model training tasks in a task group to multiple model training resources of different types for parallel processing different model training tasks can use different models at the same time. training resources, thereby avoiding the competition for model training resources between different model training tasks, improving the utilization of model training resources, and improving the efficiency of model training.
  • Figure 1 is a schematic diagram of the architecture of a model training system according to an exemplary embodiment.
  • the model training system may include a task analysis unit 101, a task scheduling unit 102 and a model training resource 103.
  • the model training resources 103 may include but are not limited to storage resources, CPU resources, GPU resources, network resources, etc.
  • the task analysis unit 101 obtains a task group including multiple model training tasks, and obtains the estimated duration of each model training resource used by each model training task in the task group. Then, the estimated duration of using each model training resource for each model training task in the task group is transmitted to the task scheduling unit 102 together with the task group.
  • the task scheduling unit 102 can sort the model training tasks in the task group to obtain multiple alternative scheduling modes. And based on the estimated duration of each model training resource used by each model training task, the best target scheduling mode is selected from the alternative scheduling modes. According to the target scheduling mode, the model training tasks are respectively scheduled to different model training resources in the training resources 103.
  • the task group includes task A and task B, and the target scheduling mode indicates that task A is sequenced before task B.
  • the task scheduling unit 102 first schedules task A to the storage resources in the model training resource 103.
  • the model training resource 103 returns the results obtained by processing task A to the task scheduling unit. 102.
  • the task scheduling unit 102 then schedules task A to the CPU resource based on the result, and at the same time, schedules task B to the storage resource. While task A uses CPU resources, task B uses storage resources in parallel.
  • the task scheduling unit 102 schedules task A to the GPU in the model training resource 103 based on the results obtained from processing task A. resources, and schedules task B to the CPU resources in the model training resource 103 based on the results obtained from processing task B. Then, while task A uses GPU resources, task B uses CPU resources in parallel. The following steps can be deduced in this way and will not be repeated here.
  • Figure 2 is a flowchart of a method for scheduling model training tasks according to an exemplary embodiment.
  • the execution subject of this method can be implemented as any device, platform, server or device cluster with computing and processing capabilities.
  • the method may include the following steps:
  • step 201 a target task group is determined.
  • a target task group may be obtained, which includes multiple model training tasks to be processed.
  • the model training tasks may be training tasks involving various deep learning models.
  • the model involved can be a convolutional neural network CNN, a deep reinforcement learning network DRN, or a deep interest network DIN, etc. It can be understood that this embodiment does not limit the specific type of the model.
  • multiple model training tasks can be randomly obtained from the task pool to form a target task group.
  • a preset algorithm can also be used to analyze and combine the model training tasks in the task pool, thereby obtaining a target task group including multiple model training tasks. It's understandable, and you can also use any other reasonable method To obtain the target task group, this embodiment does not limit the specific method of obtaining the target task.
  • step 202 multiple model training tasks in the target task group are scheduled to multiple model training resources of different types for parallel processing, so that different model training tasks use different model training resources at the same time.
  • multiple model training tasks in the target task group can be simultaneously scheduled to multiple model training resources of different types for parallel processing, so that different model training tasks use different model training resources at the same time.
  • multiple model training resources of different types may include but are not limited to storage resources, CPU resources, GPU resources, network resources, etc.
  • the number of model training tasks in the target task group should be less than or equal to the number of model training resources.
  • task scheduling information may be determined first.
  • the task scheduling information may include the processing order of multiple model training tasks in the target task group.
  • the multiple model training tasks may be assigned to the target task group based on the task scheduling information. Scheduling to multiple model training resources respectively allows different model training tasks to use different model training resources at the same time.
  • the model training process can be divided into multiple training stages, and each model training task is scheduled once in each training stage. At the beginning of each training phase, different model training tasks are scheduled to different model training resources. When the processing results of each model training task are returned, the current training phase is completed and the next training phase is entered. For the same model training resource, the model training task uses the model training resource in different training stages according to the processing sequence included in the task scheduling information.
  • the target task group includes task A, task B, and task C
  • the model training resources include resource 1, resource 2, and resource 3.
  • the task processing order included in the task scheduling information is task B, task A, and task C.
  • task B can be scheduled to resource 1 first.
  • task B is scheduled to resource 2 based on result B1
  • task A is scheduled to resource 1 at the same time.
  • task B2 obtained by task B using resource 2 and the result A1 obtained by task A using resource 1 are returned, task B is scheduled to resource 3 based on result B2, and task A is scheduled to resource 2 based on result A1.
  • Task C is scheduled to resource 1.
  • task B is scheduled to resource 1 based on result B3, and task B is scheduled to resource 1 based on result A2.
  • the process of scheduling a model training task is equivalent to a training stage.
  • training phase a task B is scheduled to resource 1
  • task A is scheduled to resource 3
  • task C is scheduled to resource 2.
  • training phase b is entered.
  • task B is scheduled to resource 2
  • task A is scheduled to resource 1
  • task C is scheduled to resource 3.
  • multiple model training tasks can be scheduled to multiple model training resources of different types through the same process, so as to reduce the additional overhead of model training task scheduling by merging execution environments.
  • multiple model training The resources include GPU resources.
  • Different model training tasks can use GPU resources through the context of the same unified computing device architecture CUDA. Since GPU resources are used in the same CUDA context, the overhead of switching CUDA contexts can be eliminated and execution efficiency is improved.
  • the present disclosure provides a scheduling method for model training tasks.
  • scheduling multiple model training tasks in a task group to multiple model training resources of different types for parallel processing different model training tasks can use different models at the same time. training resources, thereby avoiding the competition for model training resources between different model training tasks, improving the utilization of model training resources, and improving the efficiency of model training.
  • Figure 3A is a flow chart of another method for scheduling model training tasks according to an exemplary embodiment. This embodiment describes a process of determining task scheduling information, including the following steps:
  • step 301 multiple alternative scheduling modes are determined.
  • different scheduling modes correspond to different processing orders of model training tasks, and multiple alternative scheduling modes can be determined through enumeration.
  • the target task group includes task A, task B, and task C
  • the model training resources include resource 1, resource 2, and resource 3.
  • the scheduling modes M1 and M2 can be obtained through enumeration.
  • the processing sequence corresponding to the scheduling mode M1 is task A, task B, and task C
  • the processing sequence corresponding to the scheduling mode M2 is task A, task C, and task B.
  • the scheduling mode corresponding to sequential ABC is the same as the scheduling mode corresponding to sequential BCA and sequential CAB.
  • step 302 the reference index corresponding to each alternative scheduling mode and related to the usage efficiency of model training resources is estimated. And, in step 303, select a target scheduling mode from multiple alternative scheduling modes according to the reference index, and determine task scheduling information based on the target scheduling mode.
  • Figure 3B and Figure 3C are schematic diagrams of an iterative process of model training tasks A, task B, and task C in two scheduling modes using model training resources 1, resource 2, and resource 3.
  • the horizontal axis represents time
  • the length of the rectangle in the horizontal axis direction represents the length of time the model training task uses model training resources
  • the number in the rectangle represents the model training resources used by the model training task.
  • task A is scheduled to resource 1, and the duration of task A using resource 1 is (t2-t1).
  • Task B is scheduled to resource 2, and the duration of task B using resource 2 is (t2-t1)/2.
  • Task C is scheduled to resource 3, and the duration of task C using resource 3 is also (t2-t1)/2.
  • the nth training stage is entered.
  • Task A is scheduled to resource 2.
  • the duration of task A using resource 2 is (t3-t2)/2.
  • Task B is scheduled to resource 3, and the duration of task B using resource 3 is (t3-t2).
  • Task C is scheduled to resource 1, and the duration of task C using resource 1 is also (t3-t2)/2.
  • the subsequent process is analogous, and after t4, the next iteration process is entered.
  • task A is scheduled to the resource 1.
  • the duration that task A uses resource 1 is (t6-t5).
  • Task B is scheduled to resource 3, and the duration of task B using resource 3 is also (t6-t5).
  • Task C is scheduled to resource 2, and the duration of task C using resource 2 is also (t6-t5).
  • the nth training stage is entered.
  • Task A is scheduled to resource 2.
  • the duration of task A using resource 2 is (t7-t6)/2.
  • Task B is scheduled to resource 1, and the duration that task B uses resource 1 is also (t7-t6)/2.
  • Task C is scheduled to resource 3, and the time that task C uses resource 3 is also (t7-t6)/2.
  • the subsequent process is analogous, and after t8, the next iteration process is entered. Therefore, by comparing Figure 3B and Figure 3C, it can be seen that under the scheduling mode shown in Figure 3C, the utilization rate of model training resources is higher.
  • the reference index corresponding to each alternative scheduling mode can be estimated, and the reference index is related to the usage efficiency of model training resources. Then, according to the reference index, the scheduling mode with the highest usage efficiency of model training resources is selected from the alternative scheduling modes as the target scheduling mode.
  • the first estimated duration of using each model training resource for each model training task can be obtained.
  • the first estimated duration of each model training resource used by each model training task can be calculated directly through the preset algorithm.
  • the duration of use of any model training resource for any model training task does not change much. Therefore, the duration of using each model training resource for some model training tasks under certain conditions can be stored in advance.
  • the first estimated duration of the model training task using the model training resource can be searched from a pre-stored database. If the first estimated duration is not recorded in the pre-stored data, the first estimated duration is obtained through analysis and calculation based on the model training resources and the model training task.
  • a pre-deployed model performance analysis tool can be used to calculate the first estimated time duration of the model training task using the model training resources.
  • the first estimated duration obtained by analysis and calculation can be stored in a database, so that the first estimated duration of the model training task using the model training resource can be directly obtained from the database in the future. Since this embodiment pre-stores the duration of each model training resource used by some model training tasks under certain conditions in the database, the computing overhead caused by analyzing and calculating the first estimated duration is reduced in the process of obtaining the first estimated duration. .
  • the reference indicators corresponding to each alternative scheduling mode can be estimated based on the first estimated duration of each model training resource used by each model training task.
  • the reference index may be various reference indexes related to the usage efficiency of model training resources. Specifically, based on the first estimated duration of each model training resource used by each model training task, the second estimated duration of an iterative process corresponding to each alternative scheduling mode can be calculated, and determined based on the second estimated duration. Reference indicators corresponding to each alternative scheduling method.
  • an iterative process corresponding to any alternative scheduling mode may include a stage in which the model training task uses each model training resource.
  • Figure 3B and Figure 3C each show an iterative process corresponding to different scheduling modes.
  • the second estimated duration of an iterative process corresponding to each alternative scheduling mode can be simulated through simulation.
  • the second estimated duration of an iterative process corresponding to each alternative scheduling mode can also be obtained by calculation. Specifically, for any alternative scheduling mode, the longest duration of using model training resources in each training stage during an iteration corresponding to the alternative scheduling mode can be added up and summed to obtain the alternative scheduling. The second estimated duration corresponding to the pattern.
  • task A uses resource 1 for the longest time, which is (t2-t1).
  • task B uses resource 3 for the longest time, which is (t3-t2).
  • task C uses resource 2 for the longest time, which is (t4-t3). Therefore, (t2-t1), (t3-t2) and (t4-t3) are added to obtain the corresponding second estimated duration of the scheduling mode as (t4-t1).
  • the model training corresponding to each alternative scheduling mode can be determined based on the second estimated duration of an iterative process corresponding to each alternative scheduling mode.
  • the usage efficiency of model training resources corresponding to any alternative scheduling mode can be obtained as follows: the sum of the first estimated duration of each model training resource used by each model training task is divided by the corresponding The second estimated duration of an iteration process is divided by the number of model training resources.
  • the usage efficiency of model training resources corresponding to each alternative scheduling method can be used as a reference indicator corresponding to the alternative scheduling method.
  • the second estimated duration of an iterative process is (t4-t1)
  • the number of model training resources is 3, and each model training task uses 100% of each model training resource.
  • the second estimated duration of an iterative process corresponding to each alternative scheduling mode may also be directly used as the reference index corresponding to the alternative scheduling mode. Since the duration of an iterative process is negatively related to the usage efficiency of model training resources, the smaller the second estimated duration is, the higher the usage efficiency of model training resources is.
  • This embodiment determines multiple alternative scheduling modes, estimates the reference index corresponding to each scheduling mode, and selects a target scheduling mode from the multiple alternative scheduling modes based on the reference index to determine task scheduling information. Since the reference index is related to the usage efficiency of model training resources, this embodiment fully considers the usage efficiency of model training resources when determining task scheduling information, and selects the scheduling mode that can maximize the usage efficiency of model training resources for the model. Training tasks are scheduled, thereby further improving the utilization of model training resources and improving the efficiency of model training.
  • the present disclosure also provides embodiments of a scheduling device for model training tasks.
  • Figure 4 is a block diagram of a scheduling device for a model training task according to an exemplary embodiment of the present disclosure.
  • the device may include: an acquisition module 401, a determination module 402 and a scheduling module 403.
  • the acquisition module 401 is used to determine a target task group, which includes multiple model training tasks to be processed.
  • Determining module 402 is used to determine task scheduling information.
  • the task scheduling information includes the processing order of multiple model training tasks.
  • the scheduling module 403 is used to schedule multiple model training tasks to use multiple model training resources in parallel based on task scheduling information, so that different model training tasks use different model training resources at the same time.
  • the scheduling module 403 is configured to: for any model training resource, schedule multiple model training tasks to use the model training resource according to the above-mentioned processing sequence included in the task scheduling information. Among them, multiple model training tasks are scheduled according to training stages, and each model training task is scheduled once in each training stage.
  • the determination module 402 may include: an alternative sub-module, an estimation sub-module and a selection sub-module (not shown in the figure).
  • the alternative sub-module is used to determine multiple alternative scheduling modes.
  • the estimation sub-module is used to estimate the reference indicators corresponding to each scheduling mode.
  • the reference indicators are related to the usage efficiency of model training resources.
  • the selection submodule is used to select a target scheduling mode from multiple alternative scheduling modes according to the reference index, and determine task scheduling information based on the target scheduling mode.
  • the selection sub-module is configured to: select the scheduling mode with the highest usage efficiency of model training resources from multiple alternative scheduling modes as the target scheduling mode according to the reference index.
  • the estimation sub-module is configured to: determine a first estimated duration of each model training task using each model training resource. According to the above-mentioned first estimated duration, the reference indicators corresponding to each alternative scheduling mode are estimated.
  • the estimation sub-module is configured as follows: The following method is used to determine the first estimated duration of the model training task using the model training resources: finding the first estimated duration of the model training task using the model training resources from pre-stored data. If the first estimated duration of the model training task using the model training resource is not found, the first estimated duration is calculated based on the model training resource and the model training task.
  • the estimation sub-module estimates the reference index corresponding to the alternative scheduling mode in the following manner: based on the above-mentioned first estimated duration, calculates the reference index corresponding to the alternative scheduling mode.
  • the second estimated duration of an iteration process, and the reference index corresponding to the alternative scheduling method is determined based on the second estimated duration.
  • the number of model training tasks included in the target task group is less than or equal to the number of different types of model training resources.
  • multiple model training tasks are scheduled to multiple model training resources of different types through the same process.
  • multiple model training resources include GPU resources, and different model training tasks use GPU resources through the context of the same unified computing device architecture CUDA.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the embodiments of the present disclosure. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • FIG. 5 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure.
  • the electronic device 910 includes a processor 911 and a memory 912, which can be used to implement a client or a server.
  • Memory 912 is used to non-transitoryly store computer-executable instructions (eg, one or more computer program modules).
  • the processor 911 is configured to run the computer executable instructions. When the computer executable instructions are run by the processor 911, the computer executable instructions can perform one or more steps in the scheduling method of the model training task described above, thereby achieving the above. Scheduling method for model training tasks.
  • Memory 912 and processor 911 may be interconnected by a bus system and/or other forms of connection mechanisms (not shown).
  • the processor 911 may be a central processing unit (CPU), a graphics processing unit (GPU), or other forms of processing units with data processing capabilities and/or program execution capabilities.
  • the central processing unit (CPU) may be of X86 or ARM architecture.
  • the processor 911 may be a general-purpose processor or a special-purpose processor and may control other components in the electronic device 910 to perform desired functions.
  • memory 912 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc.
  • Non-volatile memory example For example, it can include read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, etc.
  • One or more computer program modules may be stored on a computer-readable storage medium, and the processor 911 may run one or more computer program modules to implement various functions of the electronic device 910 .
  • Various application programs and various data, as well as various data used and/or generated by the application programs, etc. can also be stored in the computer-readable storage medium.
  • FIG. 6 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
  • the electronic device 920 is, for example, suitable for implementing the scheduling method of model training tasks provided by embodiments of the present disclosure.
  • the electronic device 920 may be a terminal device or the like, and may be used to implement a client or a server.
  • the electronic device 920 may include, but is not limited to, a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal (such as a vehicle-mounted navigation terminal), Mobile terminals such as wearable electronic devices and fixed terminals such as digital TVs, desktop computers, smart home devices, etc.
  • PDA personal digital assistant
  • PAD tablet computer
  • PMP portable multimedia player
  • a vehicle-mounted terminal such as a vehicle-mounted navigation terminal
  • Mobile terminals such as wearable electronic devices and fixed terminals such as digital TVs, desktop computers, smart
  • the electronic device 920 may include a processing device (eg, central processing unit, graphics processor, etc.) 921 , which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 922 or loaded from a storage device 928
  • the program in the memory (RAM) 923 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 920 are also stored.
  • the processing device 921, ROM 922 and RAM 923 are connected to each other through a bus 924.
  • An input/output (I/O) interface 925 is also connected to bus 924.
  • the following devices may be connected to the I/O interface 925: input devices 926 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 927 such as a computer; a storage device 928 including a magnetic tape, a hard disk, etc.; and a communication device 929.
  • the communication device 929 may allow the electronic device 920 to communicate wirelessly or wiredly with other electronic devices to exchange data.
  • FIG. 6 illustrates electronic device 920 having various means, it should be understood that implementation or provision of all illustrated means is not required and electronic device 920 may alternatively implement or be provided with more or fewer means.
  • the above-mentioned scheduling method of model training tasks may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer-readable medium.
  • the computer program includes program code for performing the above-mentioned scheduling method for model training tasks.
  • the computer program may be downloaded and installed from the network via communication device 929, or from storage device 928, or from ROM 922.
  • the functions defined in the scheduling method of model training tasks provided by embodiments of the present disclosure can be implemented.
  • Figure 7 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • the storage medium 930 may be a non-transitory computer-readable storage medium for storing non-transitory computer-executable instructions 931 .
  • the scheduling method of the model training task described in the embodiment of the present disclosure can be implemented.
  • the non-transitory computer-executable instructions 931 are executed by the processor, the method according to One or more steps in the scheduling method of the model training task described above.
  • the storage medium 930 may be applied in the above-mentioned electronic device.
  • the storage medium 930 may include a memory in the electronic device.
  • the storage medium may include a memory card of a smartphone, a storage component of a tablet computer, a hard drive of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the above storage media can also be other suitable storage media.
  • the description of the storage medium 930 may refer to the description of the memory in the embodiment of the electronic device, and repeated descriptions will not be repeated.
  • the specific functions and technical effects of the storage medium 930 please refer to the above description of the scheduling method of the model training task, which will not be described again here.
  • a computer-readable medium may be a tangible medium that may contain or be stored for use by or in conjunction with an instruction execution system, apparatus, or device. program.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • the computer-readable storage medium may be, for example, but is not limited to: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof.
  • Computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmd read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein.
  • Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本公开提供一种模型训练任务的调度方法、装置及电子设备,所述方法的一具体实施方式包括:确定目标任务组;所述目标任务组中包括待处理的多个模型训练任务;确定任务调度信息;所述任务调度信息包括所述多个模型训练任务的处理顺序;基于所述任务调度信息,调度所述多个模型训练任务并行使用所述多个模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。该实施方式避免了模型训练资源在不同模型训练任务之间的争抢,提高了模型训练资源的利用率,提升了模型训练的效率。

Description

模型训练任务的调度方法、装置及电子设备
相关申请的交叉引用
本申请要求申请号为202211001696.0,题为“模型训练任务的调度方法、装置及电子设备”、申请日为2022年8月20日的中国发明专利申请的优先权,通过引用方式将该申请整体并入本文。
技术领域
本公开涉及机器学习技术领域,特别涉及一种模型训练任务的调度方法、装置及电子设备。
背景技术
随着人工智能技术的不断发展,深度学习已被广泛应用到各个领域,训练深度学习模型也成为一项重要任务。在模型训练的过程中需要使用多种资源,深度学习模型在模型大小和类型上有着巨大差异,多种资源都有可能成为深度学习模型训练任务的瓶颈,使得深度学习模型的训练过程中,资源利用率低,训练效率难以提升。目前,需要一种能够有效提高模型训练效率的方法。
发明内容
本公开提供一种模型训练任务的调度方法、装置及电子设备。
根据第一方面,提供一种模型训练任务的调度方法,所述方法包括:
确定目标任务组;所述目标任务组中包括待处理的多个模型训练任务;
确定任务调度信息;所述任务调度信息包括所述多个模型训练任务的处理顺序;
基于所述任务调度信息,调度所述多个模型训练任务并行使用所述多个模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。
根据第二方面,提供一种模型训练任务的调度装置,所述装置包括:
获取模块,用于确定目标任务组;所述目标任务组中包括待处理的多个模型训练任务;
确定模块,用于确定任务调度信息;所述任务调度信息包括所述多个模型训练任务的处理顺序;
调度模块,用于基于所述任务调度信息,调度所述多个模型训练任务并行使用所述多个 模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。
根据第三方面,提供一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面中任一项所述的方法。
根据第四方面,提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现第一方面中任一项所述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:
本公开的实施例提供的一种模型训练任务的调度方法及装置,通过将一个任务组中的多个模型训练任务调度给不同类型的多个模型训练资源进行并行处理,使不同的模型训练任务同时使用不同的模型训练资源,从而避免了模型训练资源在不同模型训练任务之间的争抢,提高了模型训练资源的利用率,提升了模型训练的效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本公开根据一示例性实施例示出的一种模型训练系统的构架示意图;
图2是本公开根据一示例性实施例示出的一种模型训练任务的调度方法的流程图;
图3A是本公开根据一示例性实施例示出的另一种模型训练任务的调度方法的流程图;
图3B是本公开根据一示例性实施例示出的一种模型训练任务的调度场景示意图;
图3C是本公开根据一示例性实施例示出的另一种模型训练任务的调度场景示意图;
图4是本公开根据一示例性实施例示出的一种模型训练任务的调度装置框图;
图5是本公开一些实施例提供的一种电子设备的示意框图;
图6是本公开一些实施例提供的另一种电子设备的示意框图;
图7是本公开一些实施例提供的一种存储介质的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施 例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。
下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
随着人工智能技术的不断发展,深度学习已被广泛应用到各个领域,训练深度学习模型也成为一项重要任务。在模型训练的过程中需要使用多种资源,例如,在模型训练的一次迭代过程中,需要依次完成以下阶段:读取训练数据(使用存储资源);预处理数据和强化学习中的仿真操作(使用CPU资源);前向和反向传播过程(使用GPU资源);分布式训练中工作端之间进行梯度同步(使用网络资源)等。
深度学习模型在模型大小和类型上有着巨大差异,多种资源都有可能成为深度学习模型训练任务的瓶颈。目前来说,在相关技术中,通常使深度学习模型训练任务独占各类资源,或只考虑GPU资源的共享。然而,当深度学习模型训练只使用GPU资源,或GPU资源是深度学习训练瓶颈时,独占资源或只考虑GPU资源共享的资源分配方案能在一定程度上提升深度学习训练的速度(即训练吞吐量)。但只考虑GPU资源共享的工作中,不同模型训练任务使用相同的GPU资源会导致资源争抢,使资源使用和任务完成时间增加,从而使得模型训练的效率降低。
本公开提供的一种模型训练任务的调度方法,通过将一个任务组中的多个模型训练任务调度给不同类型的多个模型训练资源进行并行处理,使不同的模型训练任务同时使用不同的模型训练资源,从而避免了模型训练资源在不同模型训练任务之间的争抢,提高了模型训练资源的利用率,提升了模型训练的效率。
参见图1,为根据一示例性实施例示出的一种模型训练系统的构架示意图。
如图1所示,模型训练系统可以包括任务分析单元101,任务调度单元102以及模型训练资源103。其中,模型训练资源103中可以包括但不限于存储资源,CPU资源,GPU资源和网络资源等。具体地,首先,任务分析单元101获取包括多个模型训练任务的任务组,并获取任务组中每个模型训练任务使用每个模型训练资源的估计时长。然后,将任务组中每个模型训练任务使用每个模型训练资源的估计时长和任务组一起传输给任务调度单元102。
任务调度单元102可以对任务组中的模型训练任务进行排序,得到多个备选的调度模式。并根据每个模型训练任务使用每个模型训练资源的估计时长,从备选的调度模式中选择最佳的目标调度模式。按照目标调度模式,将模型训练任务分别调度给训练资源103中的不同模型训练资源。
例如,任务组中包括任务A和任务B,目标调度模式指示任务A的顺序在任务B之前。在模型训练开始之后,任务调度单元102首先将任务A调度给模型训练资源103中的存储资源,在任务A使用完成存储资源之后,模型训练资源103将处理任务A得到的结果返回给任务调度单元102。任务调度单元102再基于该结果将任务A调度给CPU资源,同时,将任务B调度给存储资源。在任务A使用CPU资源的同时,任务B并行地使用存储资源。当模型训练资源103将处理任务A的结果和处理任务B的结果都返回给任务调度单元102之后,任务调度单元102再基于处理任务A得到的结果将任务A调度给模型训练资源103中的GPU资源,并基于处理任务B得到的结果将任务B调度给模型训练资源103中的CPU资源。接着,在任务A使用GPU资源的同时,任务B并行地使用CPU资源。后面的步骤以此类推,在此不再赘述。
下面将结合具体的实施例对本公开进行详细描述。
图2为根据一示例性实施例示出的一种模型训练任务的调度方法的流程图。该方法的执行主体可以实现为任何具有计算、处理能力的设备、平台、服务器或设备集群。该方法可以包括以下步骤:
如图2所示,在步骤201中,确定目标任务组。
在本实施例中,可以获取目标任务组,该目标任务组中包括多个待处理的模型训练任务,该模型训练任务可以是涉及各种深度学习的模型的训练任务。例如,所涉及的模型可以是卷积神经网络CNN,也可以是深度强化学习网络DRN,还可以是深度兴趣网络DIN等。可以理解,本实施例对模型的具体类型方面不限定。
在一种实现方式中,可以从任务池中随机获取多个模型训练任务,构成目标任务组。在另一种实现方式中,还可以采用预设的算法对任务池中的模型训练任务进行分析组合,从而得到包括多个模型训练任务的目标任务组。可以理解,还可以通过其它方式任意合理的方式 获取目标任务组,本实施例对获取目标任务在的具体方式方面不限定。
在步骤202中,将目标任务组中的多个模型训练任务调度给不同类型的多个模型训练资源进行并行处理,使不同的模型训练任务同时使用不同的模型训练资源。
在本实施例中,可以将目标任务组中的多个模型训练任务同时调度给不同类型的多个模型训练资源进行并行处理,使不同的模型训练任务同时使用不同的模型训练资源。其中,不同类型的多个模型训练资源可以包括但不限于存储资源,CPU资源,GPU资源以及网络资源等。另外,目标任务组中模型训练任务的数目应该小于等于模型训练资源的数目。
可选地,在一种实现方式中,可以首先确定任务调度信息,该任务调度信息中可以包括目标任务组中多个模型训练任务的处理顺序,可以根据该任务调度信息将多个模型训练任务分别调度给多个模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。
具体来说,可以将模型训练过程分成多个训练阶段,每个训练阶段对每个模型训练任务进行一次调度。在每个训练阶段开始时,将不同模型训练任务调度给不同的模型训练资源。当每个模型训练任务的处理结果均被返回之后,当前训练阶段完成,并进入下一个训练阶段。对于同一个模型训练资源来说,模型训练任务按照任务调度信息中包括的处理顺序,在不同训练阶段使用该模型训练资源。
例如,目标任务组中包括任务A、任务B和任务C,模型训练资源包括资源1、资源2和资源3。任务调度信息中包括的任务处理顺序为任务B、任务A、任务C。在开始训练时,可以先将任务B调度给资源1。当任务B使用资源1得到的结果B1被返回之后,基于结果B1将任务B调度给资源2,同时,将任务A调度给资源1。当任务B使用资源2得到的结果B2以及任务A使用资源1得到的结果A1均被返回之后,基于结果B2将任务B调度给资源3,基于结果A1将任务A调度给资源2,同时,将任务C调度给资源1。当任务B使用资源3得到的结果B3、任务A使用资源2得到的结果A2以及任务C使用资源1得到的结果C1均被返回之后,同时基于结果B3将任务B调度给资源1,基于结果A2将任务A调度给资源3,基于结果C1将任务C调度给资源2。
再之后就进入了训练的循环迭代的过程。其中,调度一次模型训练任务的过程相当于一个训练阶段。例如,在训练阶段a,将任务B调度给资源1,将任务A调度给资源3,将任务C调度给资源2。当任务B使用资源1得到的结果B1、任务A使用资源3得到的结果A3、任务C使用资源2得到的结果C2均被返回之后,训练阶段a结束,进入训练阶段b。在训练阶段b,将任务B调度给资源2,将任务A调度给资源1,将任务C调度给资源3。后面的过程以此类推,在此不再赘述。
可选地,可以通过同一进程将多个模型训练任务调度给不同类型的多个模型训练资源,以通过合并执行环境,降低了模型训练任务调度的额外开销。进一步可选地,多个模型训练 资源中包括GPU资源,不同的模型训练任务可以通过相同的统一计算设备架构CUDA的上下文使用GPU资源。由于在相同的CUDA上下文使用GPU资源,所以,能够消除切换CUDA上下文的开销,提高了执行效率。
本公开提供的一种模型训练任务的调度方法,通过将一个任务组中的多个模型训练任务调度给不同类型的多个模型训练资源进行并行处理,使不同的模型训练任务同时使用不同的模型训练资源,从而避免了模型训练资源在不同模型训练任务之间的争抢,提高了模型训练资源的利用率,提升了模型训练的效率。
图3A是根据一示例性实施例示出的另一种模型训练任务的调度方法的流程图,该实施例描述了确定任务调度信息的过程,包括以下步骤:
如图3A所示,在步骤301中,确定多个备选的调度模式。
在本实施例中,不同的调度模式对应于模型训练任务的不同处理顺序,可以通过枚举的方式确定多个备选的调度模式。例如,目标任务组中包括任务A、任务B和任务C,模型训练资源包括资源1、资源2和资源3。则通过枚举的方式可以得到调度模式M1和M2,其中,调度模式M1对应的处理顺序为任务A、任务B、任务C,调度模式M2对应的处理顺序为任务A、任务C、任务B。需要说明的是,由于模型的训练过程是迭代循环的过程,因此,顺序ABC对应的调度模式和顺序BCA以及顺序CAB对应的调度模式均是相同的。
在步骤302中,估计每个备选的调度模式各自对应的与模型训练资源的使用效率相关的参考指标。以及,在步骤303中,根据该参考指标从多个备选的调度模式中选择目标调度模式,并基于目标调度模式确定任务调度信息。
由于每个模型训练任务使用每个模型训练资源的时长各不相同,发明人发现,在不同调度模式下,模型训练资源的使用效率也有较大差距。如图3B和图3C所示,图3B和图3C为模型训练任务A、任务B、任务C在使用模型训练资源1、资源2、资源3的两种调度模式下的一个迭代过程的示意图。其中,横轴表示时间,矩形在横轴方向的长度表示模型训练任务使用模型训练资源的时长,矩形中的数字表示模型训练任务使用的模型训练资源。
如图3B所示,在一种调度模式下,在进入第n-1个训练阶段之后,任务A被调度给资源1,任务A使用资源1的时长为(t2-t1)。任务B被调度给资源2,任务B使用资源2的时长为(t2-t1)/2。任务C被调度给资源3,任务C使用资源3的时长也为(t2-t1)/2。当任务A、任务B、任务C均被完成之后,进入第n个训练阶段,任务A被调度给资源2,任务A使用资源2的时长为(t3-t2)/2。任务B被调度给资源3,任务B使用资源3的时长为(t3-t2)。任务C被调度给资源1,任务C使用资源1的时长也为(t3-t2)/2,后面的过程以此类推,在t4之后,进入下一次迭代的过程。
如图3C所示,在另一种调度模式下,在进入第n-1个训练阶段之后,任务A被调度给资源 1,任务A使用资源1的时长为(t6-t5)。任务B被调度给资源3,任务B使用资源3的时长也为(t6-t5)。任务C被调度给资源2,任务C使用资源2的时长也为(t6-t5)。当任务A、任务B、任务C均被完成之后,进入第n个训练阶段,任务A被调度给资源2,任务A使用资源2的时长为(t7-t6)/2。任务B被调度给资源1,任务B使用资源1的时长也为(t7-t6)/2。任务C被调度给资源3,任务C使用资源3的时长也为(t7-t6)/2,后面的过程以此类推,在t8之后,进入下一次迭代的过程。因此,通过比较图3B和图3C,可以看出在图3C所示的调度模式下,模型训练资源的利用率更高。
所以,可以估计每个备选的调度模式各自对应的参考指标,该参考指标与模型训练资源的使用效率相关。然后,根据该参考指标从备选的调度模式中选择模型训练资源的使用效率最高的调度模式作为目标调度模式。
具体来说,首先,可以获取每个模型训练任务使用每个模型训练资源的第一估计时长。可以直接通过预设算法计算每个模型训练任务使用每个模型训练资源的第一估计时长。
可选地,由于在模型类型、超参数、设备配置等条件变动不大的情况下,任一模型训练任务使用任一模型训练资源的时长变动也不大。因此,可以预先将部分条件下一些模型训练任务使用每种模型训练资源的时长进行存储。对于任一模型训练任务,在获取该模型训练任务使用任一模型训练资源的第一估计时长时,可以先从预存的数据库中查找该模型训练任务使用该模型训练资源的第一估计时长。若预存的数据中未记录该第一估计时长,再根据该模型训练资源和该模型训练任务,通过分析计算得到该第一估计时长。
例如,可以采用预先部署的模型性能分析工具,计算该模型训练任务使用该模型训练资源的第一估计时长。可选地,可以将分析计算得到的该第一估计时长存入数据库中,以便以后能够直接从数据库获取该模型训练任务使用该模型训练资源的第一估计时长。由于本实施例在数据库中预先存储了部分条件下一些模型训练任务使用每种模型训练资源的时长,从而在获取第一估计时长的过程中,减少由于分析计算第一估计时长而导致的计算开销。
接着,可以根据每个模型训练任务使用每个模型训练资源的第一估计时长,估计每个备选的调度模式各自对应的参考指标。该参考指标可以是各种与模型训练资源的使用效率相关的参考指标。具体来说,可以基于每个模型训练任务使用每个模型训练资源的第一估计时长,计算每个备选的调度模式对应的一次迭代过程的第二估计时长,并基于该第二估计时长确定每个备选的调度方式对应的参考指标。
其中,对于任一模型训练任务,任一备选的调度模式对应的一次迭代过程,可以包括该模型训练任务使用每个模型训练资源的阶段。参见图3B和图3C,图3B和图3C各自示出了不同调度模式对应的一次迭代过程。
在一种实现方式中,可以通过仿真的方式模拟出每个备选调度模式对应的一次迭代过程的第二估计时长。在另一种实现方式中,也可以通过计算得到每个备选的调度模式对应的一次迭代过程的第二估计时长。具体地,针对任一备选的调度模式,可以将该备选的调度模式对应的一次迭代过程中,各个训练阶段中使用模型训练资源最长的时长相加求和,得到该备选的调度模式对应的第二估计时长。
例如,参见图3B,在图3B对应的调度模式的一次迭代过程中,在第n-1阶段,任务A使用资源1的时长最长,为(t2-t1)。在第n阶段,任务B使用资源3的时长最长,为(t3-t2),在第n+1阶段,任务C使用资源2的时长最长,为(t4-t3)。因此,将(t2-t1)、(t3-t2)以及(t4-t3)相加,得到该调度模式的对应的第二估计时长为(t4-t1)。
由于一次迭代过程的时长与模型训练资源的使用效率负相关,因此,可以基于每个备选的调度模式对应的一次迭代过程的第二估计时长,确定每个备选的调度方式对应的模型训练资源的使用效率。任一备选的调度方式对应的模型训练资源的使用效率可以通过如下方式得到:利用每个模型训练任务使用每个模型训练资源的第一估计时长的总和,除以该备选的调度模式对应的一次迭代过程的第二估计时长,再除以模型训练资源的数目。可以将每个备选的调度方式对应的模型训练资源的使用效率作为该备选调度方式对应的参考指标。
例如,参见图3B,在图3B对应的调度模式下,一次迭代过程的第二估计时长为(t4-t1),模型训练资源的数目为3,每个模型训练任务使用每个模型训练资源的第一估计时长的总和为:(t2-t1)+(t2-t1)/2+(t2-t1)/2+(t3-t2)/2+(t3-t2)+(t3-t2)/2+(t4-t3)/2+(t4-t3)/2+(t4-t3)=2(t4-t1)。因此,可以计算得到该调度方式对应的模型训练资源的使用效率为2/3。
可选地,也可以直接将每个备选的调度模式对应的一次迭代过程的第二估计时长作为该备选调度方式对应的参考指标。由于一次迭代过程的时长与模型训练资源的使用效率负相关,因此,第二估计时长越小,说明模型训练资源的使用效率越高。
本实施例通过确定多个备选的调度模式,估计每个调度模式各自对应的参考指标,并基于该参考指标从多个备选的调度模式中选取目标调度模式,以确定任务调度信息。由于参考指标与模型训练资源的使用效率相关,因此,本实施例在确定任务调度信息时,充分考虑到了模型训练资源的使用效率,选取能够使模型训练资源的使用效率达到最高的调度模式对模型训练任务进行调度,从而进一步提高了模型训练资源的利用率,提升了模型训练的效率。
另外,本公开的发明人发现了模型训练任务的不同处理顺序会导致整个训练过程资源使用效率的不同,从而进一步考虑了通过变换模型训练任务的处理顺序得到多个备选的调度模式,并从多个备选的调度模式中选取模型训练资源的使用效率最高的目标调度模式,以确定任务调度信息。本领域技术人员并未发现问题所在,因此,本公开也通过问题的发现,解决 了训练过程资源使用效率低的技术问题。
应当注意,尽管在上述实施例中,以特定顺序描述了本公开实施例的方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。相反,流程图中描绘的步骤可以改变执行顺序。附加地或备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,和/或将一个步骤分解为多个步骤执行。
与前述模型训练任务的调度方法实施例相对应,本公开还提供了模型训练任务的调度装置的实施例。
如图4所示,图4是本公开根据一示例性实施例示出的一种模型训练任务的调度装置的框图,该装置可以包括:获取模块401,确定模块402和调度模块403。
其中,获取模块401,用于确定目标任务组,该目标任务组中包括待处理的多个模型训练任务。
确定模块402,用于确定任务调度信息,任务调度信息包括多个模型训练任务的处理顺序。
调度模块403,用于基于任务调度信息,调度多个模型训练任务并行使用多个模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。
在一些实施方式中,调度模块403被配置用于:针对任一模型训练资源,调度多个模型训练任务按照该任务调度信息包括的上述处理顺序使用该模型训练资源。其中,按训练阶段对多个模型训练任务进行调度,每个训练阶段对每个模型训练任务调度一次。
在另一些实施方式中,确定模块402可以包括:备选子模块,估计子模块和选择子模块(图中未示出)。
其中,备选子模块,用于确定多个备选的调度模式。
估计子模块,用于估计每个调度模式各自对应的参考指标,该参考指标与模型训练资源的使用效率相关。
选择子模块,用于根据该参考指标从多个备选的调度模式中选择目标调度模式,并基于目标调度模式确定任务调度信息。
在另一些实施方式中,选择子模块被配置用于:根据该参考指标,从多个备选的调度模式中选择模型训练资源的使用效率最高的调度模式作为目标调度模式。
在另一些实施方式中,估计子模块被配置用于:确定每个模型训练任务使用每个模型训练资源的第一估计时长。根据上述第一估计时长,估计每个备选的调度模式各自对应的参考指标。
在另一些实施方式中,针对任一模型训练资源和任一模型训练任务,估计子模块通过如 下方式确定该模型训练任务使用该模型训练资源的第一估计时长:从预存的数据中查找该模型训练任务使用该模型训练资源的第一估计时长。若未查找到该模型训练任务使用该模型训练资源的第一估计时长,根据该模型训练资源和该模型训练任务,计算该第一估计时长。
在另一些实施方式中,针对任一备选的调度模式,估计子模块通过如下方式估计该备选的调度模式对应的参考指标:基于上述第一估计时长,计算该备选的调度模式对应的一次迭代过程的第二估计时长,并基于上述第二估计时长确定该备选的调度方式对应的参考指标。
在另一些实施方式中,目标任务组中包括的模型训练任务的数量小于等于不同类型的模型训练资源的数量。
在另一些实施方式中,通过同一进程将多个模型训练任务调度给不同类型的多个模型训练资源。
在另一些实施方式中,多个模型训练资源中包括GPU资源,不同的模型训练任务通过相同的统一计算设备架构CUDA的上下文使用GPU资源。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本公开实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
图5为本公开一些实施例提供的一种电子设备的示意框图。如图5所示,该电子设备910包括处理器911和存储器912,可以用于实现客户端或服务器。存储器912用于非瞬时性地存储有计算机可执行指令(例如一个或多个计算机程序模块)。处理器911用于运行该计算机可执行指令,该计算机可执行指令被处理器911运行时可以执行上文所述的模型训练任务的调度方法中的一个或多个步骤,进而实现上文所述的模型训练任务的调度方法。存储器912和处理器911可以通过总线系统和/或其它形式的连接机构(未示出)互连。
例如,处理器911可以是中央处理单元(CPU)、图形处理单元(GPU)或者具有数据处理能力和/或程序执行能力的其它形式的处理单元。例如,中央处理单元(CPU)可以为X86或ARM架构等。处理器911可以为通用处理器或专用处理器,可以控制电子设备910中的其它组件以执行期望的功能。
例如,存储器912可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例 如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序模块,处理器911可以运行一个或多个计算机程序模块,以实现电子设备910的各种功能。在计算机可读存储介质中还可以存储各种应用程序和各种数据以及应用程序使用和/或产生的各种数据等。
需要说明的是,本公开的实施例中,电子设备910的具体功能和技术效果可以参考上文中关于模型训练任务的调度方法的描述,此处不再赘述。
图6为本公开一些实施例提供的另一种电子设备的示意框图。该电子设备920例如适于用来实施本公开实施例提供的模型训练任务的调度方法。电子设备920可以是终端设备等,可以用于实现客户端或服务器。电子设备920可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)、可穿戴电子设备等等的移动终端以及诸如数字TV、台式计算机、智能家居设备等等的固定终端。需要注意的是,图6示出的电子设备920仅仅是一个示例,其不会对本公开实施例的功能和使用范围带来任何限制。
如图6所示,电子设备920可以包括处理装置(例如中央处理器、图形处理器等)921,其可以根据存储在只读存储器(ROM)922中的程序或者从存储装置928加载到随机访问存储器(RAM)923中的程序而执行各种适当的动作和处理。在RAM 923中,还存储有电子设备920操作所需的各种程序和数据。处理装置921、ROM 922以及RAM 923通过总线924彼此相连。输入/输出(I/O)接口925也连接至总线924。
通常,以下装置可以连接至I/O接口925:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置926;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置927;包括例如磁带、硬盘等的存储装置928;以及通信装置929。通信装置929可以允许电子设备920与其他电子设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备920,但应理解的是,并不要求实施或具备所有示出的装置,电子设备920可以替代地实施或具备更多或更少的装置。
例如,根据本公开的实施例,上述模型训练任务的调度方法可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包括用于执行上述模型训练任务的调度方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置929从网络上被下载和安装,或者从存储装置928安装,或者从ROM922安装。在该计算机程序被处理装置921执行时,可以实现本公开实施例提供的模型训练任务的调度方法中限定的功能。
图7为本公开一些实施例提供的一种存储介质的示意图。例如,如图7所示,存储介质930可以为非暂时性计算机可读存储介质,用于存储非暂时性计算机可执行指令931。当非暂时性计算机可执行指令931由处理器执行时可以实现本公开实施例所述的模型训练任务的调度方法,例如,当非暂时性计算机可执行指令931由处理器执行时,可以执行根据上文所述的模型训练任务的调度方法中的一个或多个步骤。
例如,该存储介质930可以应用于上述电子设备中,例如,该存储介质930可以包括电子设备中的存储器。例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。
例如,关于存储介质930的说明可以参考电子设备的实施例中对于存储器的描述,重复之处不再赘述。存储介质930的具体功能和技术效果可以参考上文中关于模型训练任务的调度方法的描述,此处不再赘述。
需要说明的是,在本公开的上下文中,计算机可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是,但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应 性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (13)

  1. 一种模型训练任务的调度方法,所述方法包括:
    确定目标任务组;所述目标任务组中包括待处理的多个模型训练任务;
    确定任务调度信息;所述任务调度信息包括所述多个模型训练任务的处理顺序;
    基于所述任务调度信息,调度所述多个模型训练任务并行使用所述多个模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。
  2. 根据权利要求1所述的方法,其中,所述基于所述任务调度信息,调度所述多个模型训练任务并行使用所述多个模型训练资源,包括:
    针对任一模型训练资源,调度所述多个模型训练任务按照所述任务调度信息包括的所述处理顺序使用该模型训练资源;其中,按训练阶段对所述多个模型训练任务进行调度,每个训练阶段对每个所述模型训练任务调度一次。
  3. 根据权利要求1所述的方法,其中,所述确定任务调度信息,包括:
    确定多个备选的调度模式;
    估计每个调度模式各自对应的参考指标;所述参考指标与模型训练资源的使用效率相关;
    根据所述参考指标从所述多个备选的调度模式中选择目标调度模式,并基于所述目标调度模式确定所述任务调度信息。
  4. 根据权利要求3所述的方法,其中,所述根据所述参考指标从所述多个备选的调度模式中选择目标调度模式,包括:
    根据所述参考指标,从所述多个备选的调度模式中选择模型训练资源的使用效率最高的调度模式作为所述目标调度模式。
  5. 根据权利要求3所述的方法,所述估计每个调度模式各自对应的参考指标,包括:
    确定每个模型训练任务使用每个模型训练资源的第一估计时长;
    根据所述第一估计时长,估计每个所述备选的调度模式各自对应的参考指标。
  6. 根据权利要求5所述的方法,其中,针对任一模型训练资源和任一模型训练任务,通过如下方式确定该模型训练任务使用该模型训练资源的第一估计时长:
    从预存的数据中查找该模型训练任务使用该模型训练资源的第一估计时长;
    若未查找到该模型训练任务使用该模型训练资源的第一估计时长,根据该模型训练资源和该模型训练任务,计算该第一估计时长。
  7. 根据权利要求5所述的方法,其中,针对任一备选的调度模式,通过如下方式估计该备选的调度模式对应的参考指标:
    基于所述第一估计时长,计算该备选的调度模式对应的一次迭代过程的第二估计时长, 并基于所述第二估计时长确定该备选的调度方式对应的参考指标。
  8. 根据权利要求1所述的方法,其中,所述目标任务组中包括的模型训练任务的数量小于等于不同类型的所述模型训练资源的数量。
  9. 根据权利要求1所述的方法,其中,通过同一进程将所述多个模型训练任务调度给不同类型的多个模型训练资源。
  10. 根据权利要求1所述的方法,其中,所述多个模型训练资源中包括GPU资源;不同的模型训练任务通过相同的统一计算设备架构CUDA的上下文使用GPU资源。
  11. 一种模型训练任务的调度装置,所述装置包括:
    获取模块,用于确定目标任务组;所述目标任务组中包括待处理的多个模型训练任务;
    确定模块,用于确定任务调度信息;所述任务调度信息包括所述多个模型训练任务的处理顺序;
    调度模块,用于基于所述任务调度信息,调度所述多个模型训练任务并行使用所述多个模型训练资源,使不同的模型训练任务同时使用不同的模型训练资源。
  12. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令所述计算机执行权利要求1-10中任一项所述的方法。
  13. 一种电子设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-10中任一项所述的方法。
PCT/CN2023/112568 2022-08-20 2023-08-11 模型训练任务的调度方法、装置及电子设备 WO2024041400A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211001696.0A CN115220899A (zh) 2022-08-20 2022-08-20 模型训练任务的调度方法、装置及电子设备
CN202211001696.0 2022-08-20

Publications (1)

Publication Number Publication Date
WO2024041400A1 true WO2024041400A1 (zh) 2024-02-29

Family

ID=83615184

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/112568 WO2024041400A1 (zh) 2022-08-20 2023-08-11 模型训练任务的调度方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN115220899A (zh)
WO (1) WO2024041400A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115220899A (zh) * 2022-08-20 2022-10-21 抖音视界有限公司 模型训练任务的调度方法、装置及电子设备
CN116521380A (zh) * 2023-07-05 2023-08-01 之江实验室 一种资源自适应协同的模型训练加速方法、装置及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017127976A1 (zh) * 2016-01-25 2017-08-03 华为技术有限公司 一种用于增量式学习云系统的训练、调度方法及相关设备
CN111768006A (zh) * 2020-06-24 2020-10-13 北京金山云网络技术有限公司 一种人工智能模型的训练方法、装置、设备及存储介质
CN112000450A (zh) * 2020-08-18 2020-11-27 中国银联股份有限公司 神经网络架构搜索方法以及装置
CN114924851A (zh) * 2022-05-14 2022-08-19 云知声智能科技股份有限公司 训练任务的调度方法、装置、电子设备和存储介质
CN115220899A (zh) * 2022-08-20 2022-10-21 抖音视界有限公司 模型训练任务的调度方法、装置及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017127976A1 (zh) * 2016-01-25 2017-08-03 华为技术有限公司 一种用于增量式学习云系统的训练、调度方法及相关设备
CN111768006A (zh) * 2020-06-24 2020-10-13 北京金山云网络技术有限公司 一种人工智能模型的训练方法、装置、设备及存储介质
CN112000450A (zh) * 2020-08-18 2020-11-27 中国银联股份有限公司 神经网络架构搜索方法以及装置
CN114924851A (zh) * 2022-05-14 2022-08-19 云知声智能科技股份有限公司 训练任务的调度方法、装置、电子设备和存储介质
CN115220899A (zh) * 2022-08-20 2022-10-21 抖音视界有限公司 模型训练任务的调度方法、装置及电子设备

Also Published As

Publication number Publication date
CN115220899A (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
WO2024041400A1 (zh) 模型训练任务的调度方法、装置及电子设备
JP6983154B2 (ja) 計算グラフの処理
JP2017138964A (ja) N次元テンソルにアクセスするための命令を処理するための装置、システム、およびコンピュータによって実現される方法
CN111310904A (zh) 一种用于执行卷积神经网络训练的装置和方法
WO2019042200A1 (zh) 执行机器学习的分布式系统及其方法
CN109408214A (zh) 一种数据的并行处理方法、装置、电子设备及可读介质
CN107679625B (zh) 针对数据记录执行机器学习的分布式系统及其方法
CN110210501B (zh) 虚拟对象生成方法、电子设备及计算机可读存储介质
CN110825436B (zh) 应用于人工智能芯片的计算方法和人工智能芯片
CN115880132B (zh) 图形处理器、矩阵乘法任务处理方法、装置及存储介质
US8711160B1 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
CN115437760A (zh) 计算资源分配方法、电子设备、存储介质及程序产品
CN114721835A (zh) 边缘数据中心服务器能耗预测方法、系统、设备及介质
JP2020053013A (ja) 要求処理方法及び装置
US11941528B2 (en) Neural network training in a distributed system
US11055100B2 (en) Processor, and method for processing information applied to processor
CN116821187A (zh) 基于数据库的数据处理方法、装置、介质和电子设备
CN110825502B (zh) 神经网络处理器和用于神经网络处理器的任务调度方法
CN114816719B (zh) 多任务模型的训练方法及装置
CN110825461A (zh) 数据处理方法和装置
US8739114B2 (en) Using infeasible nodes to select branching variables
CN116149978A (zh) 服务接口测试方法、装置、电子设备及存储介质
CN115759260B (zh) 深度学习模型的推理方法、装置、电子设备和存储介质
CN113806033B (zh) 用于任务系统的任务执行方法、装置、服务器和介质
WO2023202352A1 (zh) 一种语音识别方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23856490

Country of ref document: EP

Kind code of ref document: A1