CN113590301A - Task scheduling method and related device for deep learning service - Google Patents

Task scheduling method and related device for deep learning service Download PDF

Info

Publication number
CN113590301A
CN113590301A CN202111162810.3A CN202111162810A CN113590301A CN 113590301 A CN113590301 A CN 113590301A CN 202111162810 A CN202111162810 A CN 202111162810A CN 113590301 A CN113590301 A CN 113590301A
Authority
CN
China
Prior art keywords
task
resource
numa
hardware resource
topological relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111162810.3A
Other languages
Chinese (zh)
Inventor
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111162810.3A priority Critical patent/CN113590301A/en
Publication of CN113590301A publication Critical patent/CN113590301A/en
Priority to PCT/CN2022/078419 priority patent/WO2023050712A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The application discloses a task scheduling method of deep learning service, which comprises the following steps: extracting a resource topological relation of the server through the hardware resource topological plug-in to obtain a hardware resource topological relation; determining the task type of the acquired task to be scheduled; performing resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes; and performing task scheduling based on the hardware resource nodes to avoid cross-resource task scheduling and improve the performance of using the hardware resources. The application also discloses a task scheduling device, a server and a computer readable storage medium for the deep learning service, which have the beneficial effects.

Description

Task scheduling method and related device for deep learning service
Technical Field
The present application relates to the field of computer technologies, and in particular, to a task scheduling method, a task scheduling apparatus, a server, and a computer-readable storage medium for deep learning services.
Background
With the continuous development of information technology, GPU (graphics processing unit) servers are widely used in the artificial intelligence industry to deploy deep learning service loads, and high-performance GPU and CPU (central processing unit) devices are used for calculation, storage on a solid state disk or NVMe (Non Volatile Memory Host Controller Interface Specification) devices, and the like.
In the related art, in order to improve the performance of performing deep learning service, a Non-Uniform Memory Access (NUMA) mechanism is provided, which avoids performance loss caused when multiple processors Access the same Memory by providing separate memories for the processors. Therefore, many of the current deeply trained and inferred task management tend to use the GPU and CPU binding technology to allocate the GPU and the CPU to different task types, which can reduce the performance loss caused by frequent switching of the GPU and the CPU. However, in order to reduce fragmentation, the system is often used to allocate continuous CPUs, resulting in conflict between NUMA policy and system CPU allocation policy, and an error occurs in which some applications must access remote memory across resources, which reduces the performance of the server.
Therefore, how to further improve the performance of using hardware resources is a key issue of attention for those skilled in the art.
Disclosure of Invention
The application aims to provide a task scheduling method, a task scheduling device, a server and a computer readable storage medium for deep learning services, so that cross-resource task scheduling is avoided, and the performance of using hardware resources is improved.
In order to solve the above technical problem, the present application provides a task scheduling method for deep learning service, including:
extracting a resource topological relation of the server through the hardware resource topological plug-in to obtain a hardware resource topological relation;
determining the task type of the acquired task to be scheduled;
performing resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes;
and performing task scheduling based on the plurality of hardware resource nodes.
Optionally, performing task scheduling based on the plurality of hardware resource nodes includes:
judging whether task resources corresponding to the tasks to be scheduled are stored in a plurality of numa nodes;
and if so, scheduling the task based on the numa node stored with the task resource.
Optionally, the determining whether task resources corresponding to the task to be scheduled are stored in the numa nodes includes:
determining task resource information of the task to be scheduled;
judging whether task resources corresponding to the task resource information are stored in the memories of the numa nodes; wherein the task resource comprises a data set or a computational model.
Optionally, the method further includes:
when the task resources corresponding to the tasks to be scheduled are not stored in the numa nodes, caching the task resources to any numa node;
and performing task scheduling based on any one numa node.
Optionally, the extracting the resource topological relation of the server through the hardware resource topological plug-in obtains the hardware resource topological relation, including:
loading a numa topology plug-in into the server;
inquiring through the inquiry instruction of the numa topology plug-in to obtain a GPU-CPU topology relation and a CPU-Memory topology relation;
taking the GPU-CPU topological relation and the CPU-Memory topological relation as a numa topological relation; and the hardware resource topological relation is the numa topological relation.
Optionally, determining the task type of the acquired task to be scheduled includes:
acquiring the task to be scheduled;
determining the task type of the task to be scheduled; wherein the task types include a training task and an inference task.
Optionally, the method further includes:
when a data cleaning command is received, determining a target numa node according to the data cleaning command;
and performing data cleaning on the memory of the target numa node.
Optionally, the method further includes:
recording the scheduling condition of each numa node to obtain scheduling times and data hit times;
calculating the use heat of each numa node based on the scheduling times and the data hit times to obtain a heat value;
and performing data cleaning on the memory of the numa node with the heat score smaller than the preset score according to a preset period.
The present application further provides a task scheduling device for deep learning service, including:
the resource topology acquisition module is used for extracting the resource topology relationship of the server through the hardware resource topology plug-in unit to obtain the hardware resource topology relationship;
the task type acquisition module is used for determining the task type of the acquired task to be scheduled;
the resource combination matching module is used for carrying out resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes;
and the task scheduling module is used for scheduling tasks based on the hardware resource nodes.
The present application further provides a server, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the task scheduling method as described above when executing the computer program.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the steps of the task scheduling method as described above.
The application provides a task scheduling method for deep learning service, which comprises the following steps: extracting a resource topological relation of the server through the hardware resource topological plug-in to obtain a hardware resource topological relation; determining the task type of the acquired task to be scheduled; performing resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes; and performing task scheduling based on the plurality of hardware resource nodes.
The hardware resource topological relation is obtained, then the hardware resource topological relation is combined and matched based on the task type of the task to be scheduled to obtain a plurality of hardware resource nodes, and finally the task scheduling is carried out based on the hardware resource nodes instead of the continuous CPU, so that the problem of cross-resource access of deep learning service in the operation process is avoided, the performance of hardware resources in a server is improved, and the utilization rate of the hardware resources is improved.
The present application further provides a task scheduling device, a server, and a computer-readable storage medium for deep learning services, which have the above beneficial effects and are not specifically limited herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a task scheduling method for deep learning services according to an embodiment of the present application;
fig. 2 is a flowchart of another task scheduling method for deep learning services according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a task scheduling device for deep learning services according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a task scheduling method, a task scheduling device, a server and a computer readable storage medium for deep learning service, so that cross-resource task scheduling is avoided, and the performance of using a numa mechanism is improved.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the related art, in order to improve the performance of performing deep learning services, a NUMA mechanism is provided, which avoids performance loss caused when multiple processors access the same memory by providing separate memories to the processors. Therefore, many of the current deeply trained and inferred task management tend to use the GPU and CPU binding technology to allocate the GPU and the CPU to different task types, which can reduce the performance loss caused by frequent switching of the GPU and the CPU. However, in order to reduce fragmentation, the system is often used to allocate continuous CPUs, resulting in conflict between NUMA policy and system CPU allocation policy, and an error occurs in which some applications must access remote memory across resources, which reduces the performance of the server.
Therefore, the task scheduling method of the deep learning service provided by the application obtains the hardware resource topological relation, then performs combination matching on the hardware resource topological relation based on the task type of the task to be scheduled to obtain a plurality of hardware resource nodes, and finally performs task scheduling based on the hardware resource nodes instead of performing task scheduling based on continuous cpus, so that the problem of cross-resource access of the deep learning service in the operation process is avoided, the performance of hardware resources in a server is improved, and the utilization rate of the hardware resources is improved.
The following describes a task scheduling method for deep learning services according to an embodiment.
Referring to fig. 1, fig. 1 is a flowchart of a task scheduling method for deep learning services according to an embodiment of the present disclosure.
In this embodiment, the method may include:
s101, extracting a resource topological relation of a server through a hardware resource topological plug-in to obtain a hardware resource topological relation;
s102, determining the task type of the acquired task to be scheduled;
s103, performing resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes;
and S104, performing task scheduling based on the plurality of hardware resource nodes.
The hardware resource topological relation is a topological relation describing hardware resources in the server, and the most appropriate hardware resource combination can be selected for task scheduling based on the topological relation, so that the utilization efficiency of the hardware resources is improved.
As can be seen, in the embodiment, the hardware resource topological relation is obtained, then the hardware resource topological relation is combined and matched based on the task type of the task to be scheduled, so that a plurality of hardware resource nodes are obtained, and finally, the task scheduling is performed based on the hardware resource nodes instead of performing the task scheduling based on the continuous cpu, so that the problem of cross-resource access of the deep learning service in the operation process is avoided, the performance of the hardware resources in the server is improved, and the utilization rate of the hardware resources is improved.
Based on the above embodiments, in order to further improve performance and fit application scenarios, the present embodiment applies the method to a numa-architecture server.
Referring to fig. 2, fig. 2 is a flowchart of another task scheduling method for deep learning services according to an embodiment of the present application.
In the embodiment of the application, the method may include:
s101, extracting a resource topological relation of a server through a numa topological plug-in to obtain a numa topological relation;
therefore, the method aims to extract the resource topological relation of the server through the numa topological plug-in to obtain the numa topological relation. In the process of actually applying the server, a numa mechanism is often introduced into the hardware structure of the server, so that the memory utilization efficiency is improved through a numa node.
Therefore, the obtaining of the numa topological relation in this step is to determine the numa structure in the server, so as to determine each numa node in the server.
Further, the step may include:
step 1, loading a numa topology plug-in into a server;
step 2, inquiring through an inquiry instruction of the numa topology plug-in to obtain a GPU-CPU topology relation and a CPU-Memory topology relation;
and step 3, taking the GPU-CPU topological relation and the CPU-Memory topological relation as a numa topological relation.
It can be seen that the present alternative is mainly to explain how to obtain the numa topological relation. According to the alternative scheme, the numa topology plug-in is loaded into the server, query is carried out through a query instruction of the numa topology plug-in, a GPU-CPU topology relation and a CPU-Memory topology relation are obtained, and the GPU-CPU topology relation and the CPU-Memory topology relation are used as the numa topology relation. As can be seen, in the alternative, the numa topology plug-in is mainly used for acquiring the GPU-CPU topology relationship and the CPU-Memory topology relationship in a query manner, and based on the GPU-CPU topology relationship and the CPU-Memory topology relationship, the map relationship of the numa node in the server can be determined, so that the combination of each numa resource in the server is determined, that is, each numa node is determined.
S102, determining the task type of the acquired task to be scheduled;
on the basis of S101, this step is intended to determine the task type of the acquired task to be scheduled.
This step is intended to determine the task type of the task to be scheduled, wherein different resource usage amounts exist between different task types. Therefore, the task type is determined in this step in order to determine the amount of resources of what size is needed to change the task to be scheduled.
Further, the step may include:
step 1, acquiring a task to be scheduled;
step 2, determining the task type of the task to be scheduled; the task types comprise a training task and an inference task.
It can be seen that the present alternative is mainly illustrative of how to determine the task type. In the alternative scheme, a task to be scheduled is obtained, and the task type of the task to be scheduled is determined; the task types comprise a training task and an inference task. Therefore, the deep learning service implemented in the optional technical scheme mainly comprises a training task and an inference task.
S103, performing numa resource combination matching on the numa topological relation according to the task type to obtain a plurality of numa nodes;
on the basis of S102, the numa topological relation is subjected to numa resource combination matching according to the task type, and a plurality of numa nodes are obtained.
And determining the resource type and the resource size required by the task to be scheduled based on the task type acquired in the previous step, and further searching for the numa node matched with the resource combination from the numa topological relation.
S104, judging whether task resources corresponding to the tasks to be scheduled are stored in the numa nodes;
on the basis of S103, this step is intended to determine whether task resources corresponding to the tasks to be scheduled are stored in the numa nodes. Therefore, in this step, it is determined whether a task resource that needs to be used by the task to be scheduled exists in the numa nodes, that is, whether there is available cache data.
Further, the step may include:
step 1, determining task resource information of a task to be scheduled;
step 2, judging whether task resources corresponding to the task resource information are stored in memories of the numa nodes; wherein the task resource comprises a data set or a computational model.
It can be seen that the present alternative solution mainly describes how to determine whether the task resource of the task to be scheduled exists in the numa node. I.e., to determine whether there are task resources in the numa node that can be used directly. The alternative scheme is characterized in that task resource information of a task to be scheduled is determined, and whether task resources corresponding to the task resource information are stored in memories of a plurality of numa nodes is judged; wherein the task resource comprises a data set or a computational model. That is, a determination is made as to whether there is a data set or computational model in the numa node that needs to be used.
And S105, if yes, performing task scheduling based on the numa node stored with the task resource.
In step S104, the task resources are stored in the numa nodes, and task scheduling is performed based on the numa nodes in which the task resources are stored.
The task scheduling method may be any one of the task scheduling methods provided in the prior art, and is not specifically limited herein.
Further, this embodiment may further include:
step 1, when a plurality of numa nodes do not store task resources corresponding to tasks to be scheduled, caching the task resources to any numa node;
and 2, performing task scheduling based on any numa node.
It can be seen that in this alternative, it is mainly explained how to perform task scheduling when there are no task resources. In this alternative, when the task resources corresponding to the tasks to be scheduled are not stored in the plurality of numa nodes, the task resources are cached to any one of the numa nodes, and task scheduling is performed based on any one of the numa nodes. That is, the task resources to be used are cached in any numa node, and then task scheduling is performed in the numa node. Wherein the cache includes the data set and/or the computational model at the task resource.
Further, this embodiment may further include:
step 1, when a data cleaning command is received, determining a target numa node according to the data cleaning command;
and 2, performing data cleaning on the memory of the target numa node.
It can be seen that the present alternative is mainly illustrative of how data scrubbing may be performed. In this alternative, when a data cleaning command is received, a target numa node is determined according to the data cleaning command, and data cleaning is performed on a memory of the target numa node. That is, the data cleaning is performed on the corresponding numa node based on the data cleaning command sent by the user. So as to release the memory resources in each numa node, and improve the utilization efficiency of the resources.
Further, this embodiment may further include:
step 1, recording scheduling conditions of each numa node to obtain scheduling times and data hit times;
step 2, calculating the use heat of each numa node based on the scheduling times and the data hit times to obtain a heat value;
and 3, performing data cleaning on the memory of the numa node with the heat score smaller than the preset score according to a preset period.
It can be seen that the present alternative is primarily illustrative of how data scrubbing may be performed automatically. In the alternative scheme, scheduling condition recording is carried out on each numa node to obtain scheduling times and data hit times, using heat degree calculation is carried out on each numa node based on the scheduling times and the data hit times to obtain a heat degree score, and data cleaning is carried out on a memory of the numa node with the heat degree score smaller than the preset score according to a preset period. That is, the data of the numa node is cleaned based on the acquired scheduling times and data hit times. The scheduling times refer to the times of task scheduling performed by the numa node, and the data hit times refer to the times of task resources existing and used in the numa node.
In summary, in the embodiment, the hardware resource topological relation is obtained, then the hardware resource topological relation is combined and matched based on the task type of the task to be scheduled to obtain the multiple hardware resource nodes, and finally the task scheduling is performed based on the hardware resource nodes instead of the continuous cpu, so that the problem of cross-resource access of the deep learning service in the operation process is avoided, the performance of the hardware resources in the server is improved, and the utilization rate of the hardware resources is improved.
The following further describes a task scheduling method for deep learning services provided by the present application by a specific embodiment.
In the embodiment, the Kubernets default hardware plug-in and scheduling strategy are optimized according to GPU, CPU topology and numa node information on a GPU server, so that the numa node using the optimal GPU, CPU and Memory is achieved, and meanwhile, relevant training data is pre-cached or a reasoning model is used for improving service operation performance according to the Memory use condition in the numa node.
In this embodiment, the GPU-CPU and numa information on the host are obtained and reported to kuberntes. Firstly, a numa-topo-plugin (numa topology plugin) is customized according to the equipment resource management and interface specification of kubernets and the supplement of a drive plugin of the kubernets, and the GPU-CPU topology and numa node information on a server are reported to a scheduling module of the kubernets when the plugin is initialized and operated. The topological relation of the GPU-CPU is inquired through the nvidia-smi (inquiry instruction), and the CPU-Memory topological relation of the numa node is obtained through inquiring through the nummul (inquiry instruction).
Based on this, the process of performing task scheduling in this embodiment includes:
step 1, acquiring task types (including training tasks and reasoning tasks combined by a GPU and a CPU);
step 2, searching combinations meeting the number of the GPUs and the CPUs in the step 1 according to the GPU-CPU and CPU-Memory topological relation stored in the scheduling module, and making a combination list meeting the number requirement for subsequent use;
step 3, judging whether the memory in the numa node is hit by a data set or a model in a buffer/cache (cache/buffer) in the condition of meeting the resource use number combination;
step 4, if the result in the step 3 is a hit, scheduling the task; otherwise, selecting a group of combinations from the list satisfying the number combinations to perform service data caching on the buff/cache, and performing task scheduling after the caching is finished;
and 5, recording the numa nodes for scheduling, mainly marking the used times and the data caching times of the numa nodes, and using the numa nodes as a buff/cache cleaning principle for calculation, wherein the recorded data comprises
Figure 849362DEST_PATH_IMAGE001
Representing the number of schedules meeting the computing resource requirement and
Figure 438606DEST_PATH_IMAGE002
representing the times of data hit and retention in the buff/cache;
step 6, if the service requirement is met, the data in the buff/cache is stored inIf the service is cleared after operation is finished, a system command is called to empty the memory in the numa node after the task is finished, and meanwhile, the memory recorded in the step 5 is recorded
Figure 31392DEST_PATH_IMAGE003
Performing corresponding cleaning, otherwise, reserving the data for subsequent use;
step 7, if the data cached in the buff/cache is not used and cleaned for a long time, the performance of the whole machine is influenced, so that the performance of the whole machine can be ensured by cleaning regularly, and the recorded data in the step 5 has
Figure 895443DEST_PATH_IMAGE001
Representing the number of schedules meeting the computational resource requirements (GPU, CPU) and
Figure 975395DEST_PATH_IMAGE004
indicating the number of times the data is hit and held in the buff/cache.
Wherein, the clearance rule includes:
the method comprises the steps of firstly calculating scores of each numa node according to a preset period, then sequencing, carrying out buff/cache cleaning after a certain time (default is one week) with lower scores, and updating recorded data.
The fractional algorithm calculation formula of the Numa node is as follows:
Figure 266699DEST_PATH_IMAGE005
wherein i represents the numa node number, k is the total number of numa nodes on the host, and the cache data record is given a larger weight because the cache time cost of the data set is higher, and the gain brought by the actual use of the cache data in the buff/cache is higher, so that the data cache of the frequently used numa node can be reserved.
As can be seen, in the embodiment, the hardware resource topological relation is obtained, then the hardware resource topological relation is combined and matched based on the task type of the task to be scheduled, so that a plurality of hardware resource nodes are obtained, and finally, the task scheduling is performed based on the hardware resource nodes instead of performing the task scheduling based on the continuous cpu, so that the problem of cross-resource access of the deep learning service in the operation process is avoided, the performance of the hardware resources in the server is improved, and the utilization rate of the hardware resources is improved.
In the following, the task scheduling device provided in the embodiment of the present application is introduced, and the task scheduling device described below and the task scheduling method described above may be referred to correspondingly.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a task scheduling device for deep learning services according to an embodiment of the present disclosure.
In this embodiment, the apparatus may include:
the resource topology obtaining module 100 is configured to extract a resource topology relationship from the server through the numa topology plug-in to obtain a numa topology relationship;
a task type obtaining module 200, configured to determine a task type of an obtained task to be scheduled;
the resource combination matching module 300 is used for conducting numa resource combination matching on the numa topological relation according to the task type to obtain a plurality of numa nodes;
a task resource determining module 400, configured to determine whether a task resource corresponding to a task to be scheduled is stored in a plurality of numa nodes;
and the task scheduling module 500 is configured to, when the task resource corresponding to the task to be scheduled is stored, perform task scheduling based on the numa node in which the task resource is stored.
Optionally, the resource topology obtaining module 100 is specifically configured to load a numa topology plug-in into a server; inquiring through an inquiry instruction of the numa topology plug-in to obtain a GPU-CPU topology relation and a CPU-Memory topology relation; and taking the GPU-CPU topological relation and the CPU-Memory topological relation as the numa topological relation.
Optionally, the task type obtaining module 200 is specifically configured to obtain a task to be scheduled; determining the task type of a task to be scheduled; the task types comprise a training task and an inference task.
Optionally, the task resource determining module 400 is specifically configured to determine task resource information of a task to be scheduled; judging whether task resources corresponding to the task resource information are stored in memories of the numa nodes; wherein the task resource comprises a data set or a computational model.
Optionally, the apparatus may further include:
the data caching module is used for caching the task resources to any numa node when the task resources corresponding to the tasks to be scheduled are not stored in the plurality of numa nodes; and performing task scheduling based on any numa node.
Optionally, the apparatus may further include:
the first cleaning module is used for determining a target numa node according to a data cleaning command when the data cleaning command is received; and performing data cleaning on the memory of the target numa node.
Optionally, the apparatus may further include:
the first cleaning module is used for recording the scheduling condition of each numa node to obtain the scheduling times and the data hit times; calculating the use heat of each numa node based on the scheduling times and the data hit times to obtain a heat value; and cleaning the data of the memory of the numa node with the heat score smaller than the preset score according to a preset period.
An embodiment of the present application further provides a server, including:
a memory for storing a computer program;
a processor for implementing the steps of the task scheduling method as described in the above embodiments when executing the computer program.
The embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the task scheduling method according to the above embodiments are implemented.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The task scheduling method, the task scheduling apparatus, the server and the computer-readable storage medium for the deep learning service provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (11)

1. A task scheduling method for deep learning service is characterized by comprising the following steps:
extracting a resource topological relation of the server through the hardware resource topological plug-in to obtain a hardware resource topological relation;
determining the task type of the acquired task to be scheduled;
performing resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes;
and performing task scheduling based on the plurality of hardware resource nodes.
2. The task scheduling method according to claim 1, wherein task scheduling based on the plurality of hardware resource nodes comprises:
judging whether task resources corresponding to the tasks to be scheduled are stored in a plurality of numa nodes;
and if so, scheduling the task based on the numa node stored with the task resource.
3. The task scheduling method according to claim 2, wherein the step of determining whether a plurality of numa nodes store task resources corresponding to the task to be scheduled includes:
determining task resource information of the task to be scheduled;
judging whether task resources corresponding to the task resource information are stored in the memories of the numa nodes; wherein the task resource comprises a data set or a computational model.
4. The task scheduling method according to claim 2, further comprising:
when the task resources corresponding to the tasks to be scheduled are not stored in the numa nodes, caching the task resources to any numa node;
and performing task scheduling based on any one numa node.
5. The task scheduling method according to claim 1, wherein extracting the resource topological relation of the server through the hardware resource topological plug-in to obtain the hardware resource topological relation comprises:
loading a numa topology plug-in into the server;
inquiring through the inquiry instruction of the numa topology plug-in to obtain a GPU-CPU topology relation and a CPU-Memory topology relation;
taking the GPU-CPU topological relation and the CPU-Memory topological relation as a numa topological relation; and the hardware resource topological relation is the numa topological relation.
6. The task scheduling method according to claim 1, wherein determining the task type of the acquired task to be scheduled comprises:
acquiring the task to be scheduled;
determining the task type of the task to be scheduled; wherein the task types include a training task and an inference task.
7. The task scheduling method according to claim 1, further comprising:
when a data cleaning command is received, determining a target numa node according to the data cleaning command;
and performing data cleaning on the memory of the target numa node.
8. The task scheduling method according to claim 1, further comprising:
recording the scheduling condition of each numa node to obtain scheduling times and data hit times;
calculating the use heat of each numa node based on the scheduling times and the data hit times to obtain a heat value;
and performing data cleaning on the memory of the numa node with the heat score smaller than the preset score according to a preset period.
9. A task scheduling apparatus for deep learning services, comprising:
the resource topology acquisition module is used for extracting the resource topology relationship of the server through the hardware resource topology plug-in unit to obtain the hardware resource topology relationship;
the task type acquisition module is used for determining the task type of the acquired task to be scheduled;
the resource combination matching module is used for carrying out resource combination matching on the hardware resource topological relation according to the task type to obtain a plurality of hardware resource nodes;
and the task scheduling module is used for scheduling tasks based on the hardware resource nodes.
10. A server, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the task scheduling method according to any one of claims 1 to 8 when executing said computer program.
11. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the task scheduling method according to any one of claims 1 to 8.
CN202111162810.3A 2021-09-30 2021-09-30 Task scheduling method and related device for deep learning service Pending CN113590301A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111162810.3A CN113590301A (en) 2021-09-30 2021-09-30 Task scheduling method and related device for deep learning service
PCT/CN2022/078419 WO2023050712A1 (en) 2021-09-30 2022-02-28 Task scheduling method for deep learning service, and related apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111162810.3A CN113590301A (en) 2021-09-30 2021-09-30 Task scheduling method and related device for deep learning service

Publications (1)

Publication Number Publication Date
CN113590301A true CN113590301A (en) 2021-11-02

Family

ID=78242798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111162810.3A Pending CN113590301A (en) 2021-09-30 2021-09-30 Task scheduling method and related device for deep learning service

Country Status (2)

Country Link
CN (1) CN113590301A (en)
WO (1) WO2023050712A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098238A (en) * 2022-07-07 2022-09-23 北京鼎成智造科技有限公司 Application program task scheduling method and device
WO2023050712A1 (en) * 2021-09-30 2023-04-06 苏州浪潮智能科技有限公司 Task scheduling method for deep learning service, and related apparatus
WO2023116910A1 (en) * 2021-12-24 2023-06-29 华为云计算技术有限公司 Computing resource and cache resource scheduling method and apparatus, and system
WO2024045784A1 (en) * 2022-08-29 2024-03-07 华为技术有限公司 Job scheduling method, scheduler, and related device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117032937B (en) * 2023-09-28 2024-01-09 之江实验室 Task scheduling method based on GPU, electronic device and storage medium
CN117193992B (en) * 2023-11-08 2024-02-02 浙江大华技术股份有限公司 Model training method, task scheduling device and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819540A (en) * 2009-02-27 2010-09-01 国际商业机器公司 Method and system for scheduling task in cluster
CN102027447A (en) * 2008-05-16 2011-04-20 微软公司 Local collections of tasks in a scheduler
CN107193636A (en) * 2017-05-25 2017-09-22 深信服科技股份有限公司 Virtual task simulation method and device in sandbox environment under a kind of NUMA architecture
CN107193649A (en) * 2017-05-25 2017-09-22 深信服科技股份有限公司 A kind of method for scheduling task and device based on NUMA system
CN110914805A (en) * 2017-07-12 2020-03-24 华为技术有限公司 Computing system for hierarchical task scheduling
CN111880911A (en) * 2020-06-19 2020-11-03 浪潮电子信息产业股份有限公司 Task load scheduling method, device and equipment and readable storage medium
CN113377520A (en) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 Resource scheduling method, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10452538B2 (en) * 2015-01-21 2019-10-22 Red Hat, Inc. Determining task scores reflective of memory access statistics in NUMA systems
US10325343B1 (en) * 2017-08-04 2019-06-18 EMC IP Holding Company LLC Topology aware grouping and provisioning of GPU resources in GPU-as-a-Service platform
CN110647999A (en) * 2019-08-23 2020-01-03 苏州浪潮智能科技有限公司 Method and device for improving deep learning training speed based on topological structure
CN111756802B (en) * 2020-05-26 2021-09-03 深圳大学 Method and system for scheduling data stream tasks on NUMA platform
CN113238848A (en) * 2021-05-27 2021-08-10 上海商汤科技开发有限公司 Task scheduling method and device, computer equipment and storage medium
CN113590301A (en) * 2021-09-30 2021-11-02 苏州浪潮智能科技有限公司 Task scheduling method and related device for deep learning service

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102027447A (en) * 2008-05-16 2011-04-20 微软公司 Local collections of tasks in a scheduler
CN101819540A (en) * 2009-02-27 2010-09-01 国际商业机器公司 Method and system for scheduling task in cluster
CN107193636A (en) * 2017-05-25 2017-09-22 深信服科技股份有限公司 Virtual task simulation method and device in sandbox environment under a kind of NUMA architecture
CN107193649A (en) * 2017-05-25 2017-09-22 深信服科技股份有限公司 A kind of method for scheduling task and device based on NUMA system
CN110914805A (en) * 2017-07-12 2020-03-24 华为技术有限公司 Computing system for hierarchical task scheduling
CN111880911A (en) * 2020-06-19 2020-11-03 浪潮电子信息产业股份有限公司 Task load scheduling method, device and equipment and readable storage medium
CN113377520A (en) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 Resource scheduling method, device, equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANDI DREBES等: "Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management", 《2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT)》 *
刘粟等: "Storm环境下基于拓扑结构的任务调度策略", 《计算机应用》 *
赵成: "《嵌入式系统应用基础 基于S3C2410A的SKYEYE仿真与实践》", 28 February 2012 *
黎元春: "《最新Windows Server 2003使用指南》", 30 September 2003 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023050712A1 (en) * 2021-09-30 2023-04-06 苏州浪潮智能科技有限公司 Task scheduling method for deep learning service, and related apparatus
WO2023116910A1 (en) * 2021-12-24 2023-06-29 华为云计算技术有限公司 Computing resource and cache resource scheduling method and apparatus, and system
CN115098238A (en) * 2022-07-07 2022-09-23 北京鼎成智造科技有限公司 Application program task scheduling method and device
CN115098238B (en) * 2022-07-07 2023-05-05 北京鼎成智造科技有限公司 Application program task scheduling method and device
WO2024045784A1 (en) * 2022-08-29 2024-03-07 华为技术有限公司 Job scheduling method, scheduler, and related device

Also Published As

Publication number Publication date
WO2023050712A1 (en) 2023-04-06

Similar Documents

Publication Publication Date Title
CN113590301A (en) Task scheduling method and related device for deep learning service
CN109788046B (en) Multi-strategy edge computing resource scheduling method based on improved bee colony algorithm
CN113377540A (en) Cluster resource scheduling method and device, electronic equipment and storage medium
JP2007140710A (en) Task allocation method and task allocation device
CN111143039B (en) Scheduling method and device of virtual machine and computer storage medium
CN112416585A (en) GPU resource management and intelligent scheduling method for deep learning
CN110795226B (en) Method for processing task using computer system, electronic device and storage medium
CN106815254A (en) A kind of data processing method and device
CN113946431B (en) Resource scheduling method, system, medium and computing device
US20200341819A1 (en) Information processing apparatus and distributed processing system
CN112559147B (en) Dynamic matching method, system and equipment based on GPU (graphics processing Unit) occupied resource characteristics
CN114327811A (en) Task scheduling method, device and equipment and readable storage medium
CN112084017A (en) Memory management method and device, electronic equipment and storage medium
CN105574008A (en) Task scheduling method and equipment applied to distributed file system
CN113407343A (en) Service processing method, device and equipment based on resource allocation
CN108769244B (en) Storage task information acquisition method and related device
CN107229519B (en) Task scheduling method and device
CN112688980B (en) Resource distribution method and device, and computer equipment
CN108228323A (en) Hadoop method for scheduling task and device based on data locality
CN114978951B (en) Cloud platform load balancing method
CN114003378B (en) Container cluster load balancing method, device, equipment and storage medium
CN116820730B (en) Task scheduling method, device and storage medium of multi-engine computing system
CN113296961B (en) GPU-based dynamic memory allocation method and device and memory linked list
CN117389747B (en) Data sharing method of distributed database, electronic equipment and storage medium
CN109800076B (en) Storage scheduling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211102