CN114579269A - Task scheduling method and device - Google Patents

Task scheduling method and device Download PDF

Info

Publication number
CN114579269A
CN114579269A CN202210118739.7A CN202210118739A CN114579269A CN 114579269 A CN114579269 A CN 114579269A CN 202210118739 A CN202210118739 A CN 202210118739A CN 114579269 A CN114579269 A CN 114579269A
Authority
CN
China
Prior art keywords
task
target data
historical
data
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210118739.7A
Other languages
Chinese (zh)
Inventor
徐之浩
车漾
张凯
顾荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210118739.7A priority Critical patent/CN114579269A/en
Publication of CN114579269A publication Critical patent/CN114579269A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

One or more embodiments of the present specification provide a task scheduling method and apparatus, where the method includes: determining target data accessed by the received new task and the cache state of the target data; determining the insertion position of the new task in a queue to be scheduled based on the cache state; and inserting the new task into the queue to be scheduled according to the inserting position.

Description

Task scheduling method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a task scheduling method and apparatus.
Background
In order to solve the problem that the data access speed is limited in a computing device adopting a architecture with separated computation and storage, the deployment of a cache system in the computing device is one of the solutions of the above problems. Low latency, high bandwidth data access services may be provided for tasks submitted to a computing device by deploying a caching system in the computing device. However, the problem of low cache hit rate still exists for new tasks that are run for the first time. In addition, as the data set required by task calculation is larger and larger, the limited cache capacity of the cache system also limits the range of the cached data, and further reduces the hit rate of the cache.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a task scheduling method and apparatus.
To achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
according to a first aspect of one or more embodiments of the present specification, there is provided a task scheduling method, including:
determining target data accessed by the received new task and a cache state corresponding to the target data;
determining the insertion position of the new task in a queue to be scheduled based on the cache state;
and inserting the new task into the queue to be scheduled according to the inserting position.
According to a second aspect of one or more embodiments of the present specification, there is provided a task scheduling apparatus including:
the target data determining unit is used for determining target data accessed by the received new task and the cache state of the target data;
the inserting position determining unit is used for determining the inserting position of the new task in the queue to be scheduled based on the cache state;
and the first new task inserting unit is used for inserting the new task into the queue to be scheduled according to the inserting position.
According to a third aspect of one or more embodiments of the present description, there is provided a computer readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to the first aspect.
According to a fourth aspect of one or more embodiments of the present description, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when executing the program.
According to a fifth aspect of one or more embodiments of the present specification, there is provided a deep learning training job scheduling method, including:
determining target data accessed by the received new job and the cache state of the target data;
determining the insertion position of the new job in a queue to be scheduled based on the cache state;
and inserting the new operation into the queue to be scheduled according to the inserting position.
In the technical solution provided in this specification, the order of the tasks in the queue to be scheduled is reordered by querying the cache condition of the data required by the tasks in the computing device, the execution order of each task is reasonably arranged, and the cache hit rate and the cache reuse rate are improved on the premise that the overall cache capacity is limited, thereby improving the data access efficiency.
Drawings
FIG. 1 is a schematic diagram of an architecture of a task scheduling system provided in an exemplary embodiment of the present specification;
FIG. 2 is a flowchart of a task scheduling method provided by an exemplary embodiment of the present specification;
fig. 3 is a schematic structural diagram of a scheduler provided in an exemplary embodiment of the present specification;
FIG. 4 is a flowchart of a deep learning training job scheduling method provided by an exemplary embodiment of the present description;
fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure;
FIG. 6 is a diagram of a task scheduler provided in an exemplary embodiment of the present specification;
fig. 7 is a schematic diagram of another task scheduling device provided in an exemplary embodiment of the present specification.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
In order to improve the hit rate of a cache stored in a cache system deployed in a computing device, the present specification provides a task scheduling method, which reorders the order of new tasks in a queue to be scheduled according to the cache condition of data required by the new tasks in the computing device, reasonably arranges the execution order of each task, and improves the hit rate of the cache and the reuse rate of the cache on the premise that the overall cache capacity is limited.
Fig. 1 is a schematic diagram of an architecture of a task scheduling system shown in this specification. As shown in fig. 1, may include a computing unit 11, a caching unit 12, a storage unit 13, a scheduling unit 14, a network 15, and an electronic device 16.
The computing unit 11 may be a physical server including an independent host, or the computing unit 11 may be a virtual server carried by a host cluster, and the caching unit 12, the storage unit 13, and the scheduling unit 14 are similar to the computing unit 11. And the above units may be in the same physical device, or in different physical devices, or one part of them may be in one physical device, and the other part may be in another physical device or devices. Or, at least one of the four units is a virtual server carried by the host cluster, the rest are physical servers in the same physical device or different physical devices, and data transmission is performed between the virtual server and the physical servers through the network 15. Of course, the four units may be virtual servers borne by the host cluster, data transmission is performed between the virtual servers through the network 15, and the computing units may be in the same computing cluster or in different computing clusters. During operation, the computing unit 11 may be configured with a computing device, which may be implemented in software and/or hardware, to run a new task submitted by the scheduling unit 11; the cache unit 12 may be configured with a task cache device, which may be implemented in a software and/or hardware manner, to retrieve data required to be read by the task from the storage unit for pre-caching, so as to provide a high-efficiency data access service for the computing unit 11; the storage unit 13 may be configured with a data storage device, which may be implemented in software and/or hardware to store data required for calculation; the scheduling unit 14 may be configured with a task scheduling means, which may be implemented in software and/or hardware, to schedule new tasks to be submitted so that the calculation unit 12 calculates the tasks in the order in which the scheduling unit 11 is arranged.
Electronic device 16 refers to one type of electronic device that a user may use. In fact, it is obvious that the user can also use electronic devices of the type such as: a mobile phone, a tablet device, a notebook computer, a pda (Personal Digital Assistants), a wearable device (such as smart glasses, a smart watch, etc.), etc., which are not limited by one or more embodiments of the present disclosure. In the operation process, the electronic device can be used for a user to submit tasks, the scheduling unit 14 receives the tasks submitted by the user, reasonably schedules the tasks in the queue to be scheduled based on the scheme of the specification, and the computing unit 11 sequentially executes the tasks in the queue to be scheduled.
And the network 15 for interaction between the electronic device 16 and the scheduling unit 14 may include various types of wired or wireless networks. In one embodiment, the Network 15 may include the Public Switched Telephone Network (PSTN) and the Internet.
Fig. 2 is a flowchart of a task scheduling method provided in an exemplary embodiment of the present specification, where the task scheduling method may include the following steps:
s201, determining the target data accessed by the received new task and the cache state of the target data.
In an exemplary embodiment of the present specification, the task scheduling method described above may be applied to a scheduler in a kubernets computing cluster.
Kubernetes is an open source container automation operation and maintenance platform, and a computing cluster composed of a plurality of computers can be simply and efficiently managed and controlled through the platform. In a Kubernetes computing cluster, each computing mechanism is a computing node, wherein the computing node is the minimum unit of computing resources in the Kubernetes computing cluster, and a plurality of computing nodes form a computing cluster which can realize powerful computing functions.
Kubernets is used as a computing cluster management platform, specific computing nodes of tasks can be intelligently distributed, when computing nodes in a computing cluster are changed, for example, any computing node is added or deleted, the Kubernets serving as an automatic management platform can automatically redistribute computing nodes of the tasks, and the tasks do not need to be manually distributed. Since a task is not running on a specific computing node, the task may be executed by other computing nodes as the addition or deletion of the computing node may be transferred to other computing nodes at any time, and thus, the data of the task cannot be saved in the file system of a certain node in the computing cluster. For permanently storing data, kubernets provide a data storage device, such as Persistent Volumes (PV), which is mounted in a computing cluster and is responsible for permanently storing data, and is not bound to a specific node, so that the problem of data loss caused by addition and deletion of computing nodes is avoided. The above-described architectural approach of computing and storage separation results in limited data access efficiency for tasks running in a kubernets computing cluster.
In order to improve the data access efficiency, the cache system is deployed in the kubernets computing cluster, but in the kubernets computing cluster with the cache system, when a new task is run for the first time, the problem of low cache hit rate still exists. In addition, the limited cache capacity of the Kubernetes computing cluster further reduces the cache hit rate as the data set required to process tasks with some complexity becomes larger and larger.
In an exemplary embodiment of the present specification, as shown in fig. 3, step S201 is performed by a Cache manager 312(Cache manager) of a scheduler in the kubernets computing cluster. Specifically, this step may be performed by Status Collector 3121(Status Collector) in cache manager 312.
In an exemplary embodiment of the present specification, the target data directory to which the received new task is determined to access may be obtained through a set of historical task information. And searching a data directory having a mapping relation with the new task in the historical task information set to serve as a target data directory, wherein the mapping relation between the historical task and the historical data directory accessed by the historical task is stored in the historical task information set. Because the mapping relation between the historical tasks and the historical data directories accessed by the historical tasks is stored in the historical task information set, when the historical tasks which are the same as the new tasks are found in the historical task information set, the target data directories accessed by the new tasks can be predicted through the historical data directories which are stored in the historical task information set and have the mapping relation with the historical tasks, the historical data directories are used as the predicted target data directories to be accessed by the new tasks, and the target data under the target data directories are used as the predicted new tasks, namely the data to be read in the running process.
In an exemplary embodiment of the present specification, the data directory having a mapping relationship with the new task is searched in the historical task information set to serve as the target data directory, and the target data directory can be searched through the meta information contained in the new task. The search process may include: acquiring meta-information contained in the new task; retrieving in a historical task information set based on meta-information contained in the new task, wherein a mapping relation stored in the historical task information set is established based on historical meta-information of the historical task and a historical data directory accessed by the historical meta-information; and if the historical meta information which is consistent with the meta information of the new task is searched in the historical task information set, determining the historical data directory corresponding to the historical meta information as the target data directory of the new task.
In an exemplary embodiment of the present specification, the meta information may be one or a combination of multiple information, such as a mirror image of a new task, a start command, and the like, and the present specification is not particularly limited thereto.
In an exemplary embodiment of the present specification, the historical task information set may be stored in a separate historical task information set component in the scheduler.
In an exemplary embodiment of the present specification, the frequency of accessing the history of the history data directory stored in the history task information set is not less than a preset frequency threshold. When a new task needs to access data in a subdirectory under the data source directory of the new task, the scheduler needs to perform fine-grained prediction of the directory level on the data to be accessed by the task. Therefore, after a task is started, the cache manager queries the access frequency of each data directory in the data cache system in the life cycle of the task, namely the whole running process of the task, and when the access frequency of a certain data directory is increased to a preset value, the task is judged to read the data in the corresponding data directory. After the task is finished, the cache manager may insert the meta information of the task and the read data directory as a history record into the historical task information set, so as to predict a data directory to be accessed by a new task submitted in the future, so as to cache data pointed by the data directory in advance, improve the efficiency of data access, and further improve the efficiency of calculation. Therefore, the accessed frequency of the historical data catalogues stored in the historical task information set reaches a preset value. The preset value may be determined according to the actual access frequency of the data directory, or may be manually set, which is not limited in this specification.
In an exemplary embodiment of the present specification, the target data directory may also be obtained from data source information included in the new task. When the data source directory is included in the data source information, the data source directory may be determined as the target data directory.
And after determining a target data directory accessed by the received new task, determining a cache state corresponding to target data in the target data directory according to the target data directory.
In an exemplary embodiment of the present specification, determining the cache state corresponding to the target data in the target data directory may be performed by a cache manager, and specifically, may be determined by a state collector in the cache manager.
Of course, the technical solution provided in this specification is not limited to be implemented in the scheduler in the kubernets computing cluster, and may be implemented in any scheduler that satisfies the operating condition. For example, in an exemplary embodiment of the present specification, the task scheduling method provided in the present specification is applied to a scheduler of a computing device, a storage device corresponding to the computing device and the computing device are respectively in different physical devices and mutually transmit data through a network, and the computing device is configured with a data caching system, and the data caching system may cache data in the storage device in the computing device in advance, so that the computing device may directly read locally cached data in a computing process, thereby improving the efficiency of computing. Except for the specific implementation forms of the storage device and the computing device and the location of the data cache system, the specific scheduling process in this embodiment may refer to a specific scheduling process in a kubernets computing cluster, which is not described herein again.
In an exemplary embodiment of the present specification, the task scheduling method provided by the present specification may also be applied to a scheduling unit, where the scheduling unit is located in the same physical device as the computing unit and a storage unit responsible for storing data, and the physical device is further configured with a cache unit responsible for caching data that needs to be read by the computing unit from the storage unit. The specific scheduling process in this embodiment may also refer to a specific scheduling process in a Kubernetes computing cluster.
In fact, the task scheduling method provided by the present specification can be applied to any scheduler in a computing device that is deployed with a data caching system and adopts a architecture with separate computation and storage. As for the computing unit, the storage unit, the scheduler and the data cache system in the computing device, it is specifically implemented by a computing cluster or a physical device; whether in the same computing cluster or physical device or in different computing clusters or physical devices is not specifically limited in this application.
S202, determining the insertion position of the new task in the queue to be scheduled based on the buffer state.
The inserting position of the new task to be inserted into the queue to be scheduled is determined through the target data directory determined in step S201 and the buffer state of the target data pointed to by the target data directory. The cache states of the target data of the new task include two types, one type is that the target data is not completely cached, the other type is that the target data is completely cached, and the two cache states respectively correspond to different insertion positions.
If the target data in the target data directory of the new task is not completely cached, determining that the insertion position of the new task is the adjacent position of at least one task with the same target data directory as the new task in the queue to be scheduled; and if the target data in the target data directory of the new task is completely cached, determining that the insertion position of the new task is the head of the queue to be scheduled.
For example, as shown in fig. 3, there are 5 tasks to be scheduled in the queue 311 to be scheduled in the following order: job (task or Job) 4, Job 5, Job 6, Job 7, Job 8. During the scheduling process, the scheduler will fetch the task from the head of the queue to be scheduled and run. In the queue 311 to be scheduled, it is assumed that Job 5, Job 6, and Job 7 access the same data directory as the new task. For example, Job 5, Job 6, and Job 7 are deep learning training jobs that are submitted by a user multiple times during the deep learning training process due to hyper-parameter tuning, and although each Job is modified to some extent compared to the original Job, these jobs access the same data set. In this case, when the new task is also a deep learning training Job in which the hyper-parameter adjustment is performed on the same original Job, and the new task and the Job 5, the Job 6, and the Job 7 access the same target data directory, the insertion position of the new task is determined to be a position adjacent to any of the jobs of the Job 5, the Job 6, and the Job 7, that is, the insertion position may be between the Job4 and the Job 5, between the Job 5 and the Job 6, between the Job 6 and the Job 7, or between the Job 7 and the Job 8.
Under the condition that target data of a new task is not completely cached, tasks accessing the same target data directory are arranged together, and since the tasks in the queue 311 to be scheduled run in sequence, when a group of tasks accessing the same target data directory is run, the data caching system 322 can enable all the tasks in the group to meet the running condition only by caching the target data once, and the same target data does not need to be cached repeatedly for multiple times, so that high-efficiency multiplexing of the cache is realized.
If the target data of the new task is completely cached, it is proved that the computing device has all the operating conditions of the new task at this time, and the target data pointed by the newly applied target data directory is already cached in the data caching system 322, so that the insertion position of the new task is determined to be the head of the queue 311 to be scheduled, and after the operation of the previous task is completed, the scheduler 31 can directly take out the new task inserted into the head of the queue 311 to be scheduled for operation without repeated caching.
After the insertion position of the new task is determined, the new task may be inserted into the queue to be scheduled 311 according to the insertion position.
S203, inserting the new task into the queue to be scheduled according to the inserting position.
For example, as shown in fig. 3, if it is determined that the insertion position of the new task is between Job 5 and Job 6, the new queue to be scheduled generated after the new task is inserted into the queue to be scheduled 311 is: job4, Job 5, New task, Job 6, Job 7, Job 8.
In an exemplary embodiment of the present specification, if the target data directory of the new task cannot be determined, the new task is inserted into the tail of the queue to be scheduled 311. For example, if the data source information does not include the target data directory, or the historical data directory having a mapping relationship with the new task cannot be found from the historical task information set, it is proved that the target data corresponding to the target data directory of the new task may not be cached in the data caching system 322, in other words, it is proved that the data caching system 322 does not include reusable caching resources, and therefore, the calculation for the new task needs to be performed after data pre-caching. And a new task needs to be inserted into the tail part of the queue to be buffered, and the waiting scheduler takes out the task according to the arrangement sequence of the tasks in the queue to be scheduled for pre-buffering and running.
In an exemplary embodiment of the present specification, the task scheduling method described above may be applied to scheduling of a deep learning training job.
In recent years, deep learning related technologies have been rapidly developed in various fields. The containerization technology and the container arrangement technology represented by Kubernets simplify the application deployment and operation and maintenance process and provide the abstraction and sharing capability of computing resources in large-scale clusters. These capabilities are compatible with the requirements of development stages such as deep learning training, model deployment and the like, so that running a deep learning training program on Kubernets becomes a reasonable choice.
The deep learning mainly comprises two processes of training and reasoning. Among them, the training process is a typical data-driven job, which essentially learns from data to expert experience to solve the cognitive problem of a computer. To improve the accuracy of deep learning methods, training jobs often require access to large amounts of data for greater generalization capability. However, cloud computing currently adopts a framework with separate computing and storage, and this framework limits the data access efficiency of deep learning jobs running in a kubernets cluster, and the computing resources allocated to jobs waste a large amount of time waiting for data IO, resulting in low utilization rate of the overall computing resources. It is common practice to deploy data caching systems in kubernets clusters. Similarly, however, when a deep learning training job is first run in a kubernets computing cluster, cache misses may also occur. In addition, as the data sets required for deep learning training become larger and larger, the limited cache capacity of the Kubernetes computing cluster further reduces the cache hit rate.
In order to solve the problem of low cache hit rate, the technical scheme in the specification is applied to scheduling of deep learning training jobs. In the exemplary embodiment, the computing device is a kubernets computing cluster, and the kubernets computing cluster adapted scheduler is shown in fig. 3, where scheduler 31 includes queue to be scheduled 311, cache manager 312, and job history information set 313. Where the cache manager 312 includes a packet status collector 3121 and a data prefetcher 3122. The Kubernetes computing cluster (computing device 32) includes a run queue 321 and a data caching system 322 deployed in the computing device. As shown in fig. 4, the specific deep learning training job scheduling process is as follows:
s401, determining target data accessed by the received new job and the cache state of the target data.
When a user submits a new job to the scheduler 31, the scheduler 31 searches the historical job information set 313 for a data directory having a mapping relation with the new job as a target data directory of the new job, wherein the mapping relation between the historical job and the historical data directory accessed by the historical job is stored in the historical job information set 313. Alternatively, if the new job includes the data source information, the target data directory may be acquired from the data source information.
The status collector 3121 in the cache manager 312 is utilized to determine the target data directory of the new job and to confirm the cache status of the target data corresponding to the target data directory of the new job. Specifically, the state collector 3121 in the cache manager 312 obtains the cache state of the target data by accessing the data cache system 322 in the kubernets computing cluster.
As shown in fig. 3, for a new Job 1 whose target data directory is the same as that of Job4 and Job 5, the scheduler 31 searches the historical Job information set 313 for a data directory having a mapping relationship with the new Job 1 as the target data directory of the new Job 1, and acquires the cache state of the target data corresponding to the target data directory of the new Job 1 by the state collector 3121, whereby the target data of the new Job 1 is not completely cached. After the cache state of the target data is determined,
s402, determining the insertion position of the new job in the queue to be scheduled based on the cache state.
In step S401, it can be determined that the insertion position of new Job 1 is adjacent to Job4 or Job 5, which has the same target data directory as new Job 1, and may be before Job4, in the middle of Job4 and Job 5, or after Job 5.
S403, inserting the new job into the queue to be scheduled according to the insertion position.
New job 1 is inserted into any insertion position, completing the scheduling of new job 1.
In an exemplary embodiment of the present specification, assume that a new Job 1 is inserted into a position behind Job 5, at which time, when there are available computing resources in the kubernets computing cluster, a deep learning training Job in the run queue 321 in the kubernets computing cluster is computed. Assuming that, as shown in fig. 3, just after the Job b 2 in the run queue 321 is run, so that there are remaining computing resources in the kubernets computing cluster, the scheduler 31 fetches the Job b 3 from the queue 311 to be scheduled in the scheduler 31 to run in the run queue 321 in the kubernets computing cluster. Selecting a deep learning training Job to be scheduled, namely Job4, wherein the cache manager 312 in the scheduler 31 inquires whether data which needs to be read by Job4 is cached in the data cache system 322 in the Kubernets computing cluster, if the data which needs to be read by Job4 is completely cached, the Job4 is proved to be in a state of being capable of running at any time, the scheduling is finished, and the scheduler 31 is required to take out Job4 from the queue to be scheduled to the running queue 321 for running. If the data that Job4 needs to read is not completely cached, the data prefetcher 3122 in the cache manager 312 prefetches the data that Job4 needs to read into the data caching system 322, and after the cache is completely completed, the wait scheduler 31 fetches Job4 from the queue to be scheduled into the run queue 321 for running. When Job 3 in run queue 321 has run, Job4, which had prefetched data cache into data cache system 322, is pulled out to run in run queue 321. At this time, since the target data directories of Job4, Job 5 and new Job 1 inserted into Job 5 are the same, the data that Job4 needs to read is the target data, and since the target data is already cached in the data cache system 322, after Job4 runs, it is not necessary to cache the data that Job 5 needs to read again, and the cache data corresponding to Job4 in the data cache system 322 is directly multiplexed. The process of scheduling new Job 1 after Job 5 is the same as Job 5.
In an exemplary embodiment of the present specification, when a user submits a new job 2 to the scheduler 31, the target data directory of which points to the new job is completely cached, the scheduler 31 searches the historical job information set 313 for a data directory having a mapping relationship with the new job as a target data directory. The target data directory of the new job 2 is determined by the status collector 3121 in the cache manager 312, and the cache status of the target data corresponding to the target data directory of the new job is confirmed. Specifically, the status collector 3121 obtains the cache status of the target data by accessing the data caching system 322 in the kubernets computing cluster. Since the target data buffer status of the new Job 2 is completely buffered, the insertion position of the new Job 2 is determined to be the head of the queue to be scheduled 311, and assuming that the queue to be scheduled 311 is as shown in fig. 3, the insertion position of the new Job 2 is determined to be before Job 4. Since the target data of the new job 2 is completely cached at this time, it is proved that the new job 2 does not need to cache the target data into the data caching system through the data prefetcher 3122, and therefore, the target data can be directly fetched into the run queue 321 for running. The new Job 2 is inserted into the head of the queue 311 to be scheduled, that is, before Job4, when there is a free computing resource in the kubernets computing cluster, the new Job 2 can be directly taken out to the running queue 321 for computing.
In an exemplary embodiment of this specification, when a user submits a new Job 3 that cannot find a corresponding target data directory in a historical Job information set, it is proved that data to be read by the new Job 3 is not cached in the data cache system 322, the new Job 3 is inserted into the tail of the queue to be scheduled 311, that is, after Job 8 and after Job before the new Job 3 is taken out to the run queue 321 for running, the data prefetcher 3122 caches the data to be read by the new Job 3 in the data cache system 322, and preparation before the new Job 3 is taken out for running is completed.
Fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present specification. Referring to fig. 5, at the hardware level, the apparatus includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile memory 510. Of course it is also possible to include hardware required for other functions. The processor 502 reads a corresponding computer program from the non-volatile memory 510 into the memory 508 and runs it, forming a kind of task scheduling device on a logical level. Of course, besides the software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combination of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
Corresponding to the embodiment of the method, the specification also provides a task scheduling device.
Referring to fig. 6, a task scheduling apparatus may include:
a target data determining unit 601, configured to determine target data accessed by the received new task and a cache state of the target data;
an insertion position determining unit 602, configured to determine an insertion position of the new task in a queue to be scheduled based on the buffer status;
a first new task inserting unit 603, configured to insert the new task into the queue to be scheduled according to the insertion position.
Optionally, the target data determining unit 601 may be specifically configured to:
and searching a data directory having a mapping relation with the new task in a historical task information set to serve as a target data directory, wherein data in the target data directory are the target data, and the historical task information set stores the mapping relation between the historical task and the historical data directory accessed by the historical task.
Optionally, the target data determining unit 601 may be further specifically configured to:
acquiring meta-information contained in the new task;
retrieving in the historical task information set based on meta-information included in the new task, wherein a mapping relation stored in the historical task information set is established based on historical meta-information of a historical task and a historical data directory accessed by the historical meta-information;
and if the historical meta-information which is consistent with the meta-information of the new task is searched in the historical task information set, determining the historical data directory corresponding to the historical meta-information as the target data directory.
Optionally, the frequency of accessing the history of the history data directory stored in the history task information set in the target data determination unit 601 is not less than a preset frequency threshold.
Optionally, the new task includes data source information, and the target data determining unit 601 may be specifically configured to:
and acquiring a target data directory from the data source information, wherein the data in the target data directory is the target data.
Optionally, the insertion position determining unit 602 may be specifically configured to:
if the target data are not completely cached, determining that the insertion position is an adjacent position of at least one task with the same target data as the new task in the queue to be scheduled;
and if the target data is completely cached, determining that the insertion position is the head of the queue to be scheduled.
Optionally, the task scheduling device may further include:
a second new task inserting unit 604, configured to insert the new task into the tail of the queue to be scheduled if the target data cannot be determined.
Referring to fig. 7, the present specification further provides a scheduling apparatus for deep learning training jobs, which may include:
a new job target data determination unit 701 configured to determine target data accessed by a received new job and a cache state of the target data;
a new job insertion position determining unit 702, configured to determine, based on the buffer status, an insertion position of the new job in a queue to be scheduled;
a first new job inserting unit 703, configured to insert the new job into the queue to be scheduled according to the insertion position.
Optionally, the new operation target data determining unit 701 may specifically be configured to:
and searching a data directory having a mapping relation with the new job in a historical job information set to serve as a target data directory, wherein data under the target data directory is the target data, and the historical job information set stores the mapping relation between the historical job and the historical data directory accessed by the historical job.
Optionally, the new operation target data determining unit 701 may further specifically be configured to:
acquiring meta-information contained in the new operation;
retrieving in the historical job information set based on meta-information included in the new job, wherein a mapping relation stored in the historical job information set is established based on historical meta-information of a historical job and a historical data directory accessed by the historical meta-information;
and if history meta information which is consistent with the meta information of the new job is searched in the history job information set, determining a history data directory corresponding to the history meta information as the target data directory.
Alternatively, the frequency of accessing the history of the history data directory stored in the history job information set in the new job target data determination unit 701 is not less than a preset frequency threshold.
Optionally, the new job includes data source information, and the new job target data determining unit 701 may specifically be configured to:
and acquiring a target data directory from the data source information, wherein the data in the target data directory is the target data.
Optionally, the new job insertion position determining unit 702 may be specifically configured to:
if the target data are not completely cached, determining that the insertion position is the adjacent position of at least one job with the same target data as the new job in the queue to be scheduled;
and if the target data is completely cached, determining that the insertion position is the head of the queue to be scheduled.
Optionally, the deep learning training job scheduling apparatus may further include:
a second new job inserting unit 704, configured to insert the new job into the tail of the queue to be scheduled if the target data cannot be determined.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
In one or more embodiments of the present specification, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (11)

1. A method for task scheduling, comprising:
determining target data accessed by the received new task and the cache state of the target data;
determining the insertion position of the new task in a queue to be scheduled based on the cache state;
and inserting the new task into the queue to be scheduled according to the inserting position.
2. The method of claim 1, wherein determining the target data accessed by the received new task comprises:
and searching a data directory having a mapping relation with the new task in a historical task information set to serve as a target data directory, wherein data in the target data directory are the target data, and the historical task information set stores the mapping relation between the historical task and the historical data directory accessed by the historical task.
3. The method of claim 2, wherein the searching the data directory having a mapping relationship with the new task in the historical task information set as the target data directory comprises:
acquiring meta-information contained in the new task;
retrieving in the historical task information set based on the meta-information contained in the new task, wherein the mapping relation stored in the historical task information set is established based on the historical meta-information of the historical task and a historical data directory accessed by the historical meta-information;
and if the historical meta-information which is consistent with the meta-information of the new task is searched in the historical task information set, determining the historical data directory corresponding to the historical meta-information as the target data directory.
4. The method of claim 2, wherein a frequency with which a history of a historical data catalog stored in the set of historical task information is accessed is not less than a preset frequency threshold.
5. The method of claim 1, wherein the new task includes data source information, and wherein determining target data accessed by the received new task comprises:
and acquiring a target data directory from the data source information, wherein the data in the target data directory is the target data.
6. The method of claim 1, wherein said determining an insertion location of the new task in a queue to be scheduled based on the buffer status comprises:
if the target data are not completely cached, determining that the insertion position is an adjacent position of at least one task with the same target data as the new task in the queue to be scheduled;
and if the target data is completely cached, determining that the insertion position is the head of the queue to be scheduled.
7. The method of claim 1, further comprising:
and if the target data cannot be determined, inserting the new task into the tail part of the queue to be scheduled.
8. A deep learning training job scheduling method is characterized by comprising the following steps:
determining target data accessed by the received new job and the cache state of the target data;
determining the insertion position of the new job in a queue to be scheduled based on the cache state;
and inserting the new operation into the queue to be dispatched according to the inserting position.
9. A task scheduling apparatus, comprising:
the target data determining unit is used for determining target data accessed by the received new task and the cache state of the target data;
the inserting position determining unit is used for determining the inserting position of the new task in the queue to be scheduled based on the cache state;
and the first new task inserting unit is used for inserting the new task into the queue to be scheduled according to the inserting position.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-8 are implemented when the processor executes the program.
CN202210118739.7A 2022-02-08 2022-02-08 Task scheduling method and device Pending CN114579269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210118739.7A CN114579269A (en) 2022-02-08 2022-02-08 Task scheduling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210118739.7A CN114579269A (en) 2022-02-08 2022-02-08 Task scheduling method and device

Publications (1)

Publication Number Publication Date
CN114579269A true CN114579269A (en) 2022-06-03

Family

ID=81770664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210118739.7A Pending CN114579269A (en) 2022-02-08 2022-02-08 Task scheduling method and device

Country Status (1)

Country Link
CN (1) CN114579269A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130047162A1 (en) * 2011-08-19 2013-02-21 Canon Kabushiki Kaisha Efficient cache reuse through application determined scheduling
CN103198025A (en) * 2012-01-04 2013-07-10 国际商业机器公司 Method and system form near neighbor data cache sharing
CN105302840A (en) * 2014-07-31 2016-02-03 阿里巴巴集团控股有限公司 Cache management method and device
US20160179682A1 (en) * 2014-12-18 2016-06-23 Bluedata Software, Inc. Allocating cache memory on a per data object basis
CN110908612A (en) * 2019-11-27 2020-03-24 腾讯科技(深圳)有限公司 Cache management method, device, equipment and storage medium
CN110990302A (en) * 2019-11-22 2020-04-10 北京云宽志业网络技术有限公司 Data caching method and device, electronic equipment and storage medium
CN113312278A (en) * 2021-07-29 2021-08-27 常州楠菲微电子有限公司 Device and method for statically allocating shared multi-queue cache
CN113760640A (en) * 2020-11-13 2021-12-07 北京沃东天骏信息技术有限公司 Monitoring log processing method, device, equipment and storage medium
CN113986981A (en) * 2021-11-11 2022-01-28 湖南快乐阳光互动娱乐传媒有限公司 Data synchronization method and device
CN113988306A (en) * 2021-09-28 2022-01-28 阿里巴巴(中国)有限公司 Sample data processing method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130047162A1 (en) * 2011-08-19 2013-02-21 Canon Kabushiki Kaisha Efficient cache reuse through application determined scheduling
CN103198025A (en) * 2012-01-04 2013-07-10 国际商业机器公司 Method and system form near neighbor data cache sharing
CN105302840A (en) * 2014-07-31 2016-02-03 阿里巴巴集团控股有限公司 Cache management method and device
US20160179682A1 (en) * 2014-12-18 2016-06-23 Bluedata Software, Inc. Allocating cache memory on a per data object basis
CN110990302A (en) * 2019-11-22 2020-04-10 北京云宽志业网络技术有限公司 Data caching method and device, electronic equipment and storage medium
CN110908612A (en) * 2019-11-27 2020-03-24 腾讯科技(深圳)有限公司 Cache management method, device, equipment and storage medium
CN113760640A (en) * 2020-11-13 2021-12-07 北京沃东天骏信息技术有限公司 Monitoring log processing method, device, equipment and storage medium
CN113312278A (en) * 2021-07-29 2021-08-27 常州楠菲微电子有限公司 Device and method for statically allocating shared multi-queue cache
CN113988306A (en) * 2021-09-28 2022-01-28 阿里巴巴(中国)有限公司 Sample data processing method, device, equipment and storage medium
CN113986981A (en) * 2021-11-11 2022-01-28 湖南快乐阳光互动娱乐传媒有限公司 Data synchronization method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANG-TIEN TRAN等: "Hit Ratio and Latency Optimization for Caching Systems:A Survey", 《2021 INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING(ICOIN)》, 2 February 2021 (2021-02-02), pages 577 - 581 *
张鸿骏等: "一种适应GPU的混合访问缓存索引框架", 《软件学报》, vol. 31, no. 10, 14 October 2020 (2020-10-14), pages 3038 - 3055 *
范延芳等: "基于多级队列缓存淘汰算法的处理器全数字仿真优化", 《计算机测量与控制》, vol. 26, no. 06, 25 June 2018 (2018-06-25), pages 180 - 183 *

Similar Documents

Publication Publication Date Title
US9442760B2 (en) Job scheduling using expected server performance information
US8352517B2 (en) Infrastructure for spilling pages to a persistent store
US9811329B2 (en) Cloud based file system surpassing device storage limits
US11323514B2 (en) Data tiering for edge computers, hubs and central systems
US9280470B2 (en) Cache replacement for shared memory caches
US20180136842A1 (en) Partition metadata for distributed data objects
CN102629941A (en) Caching method of a virtual machine mirror image in cloud computing system
US10114765B2 (en) Automatic recovery of application cache warmth
US8732355B1 (en) Dynamic data prefetching
US11080207B2 (en) Caching framework for big-data engines in the cloud
CN115827907B (en) Cross-cloud multi-source data cube discovery and integration method based on distributed memory
US20200364211A1 (en) Predictive database index modification
CN111488323B (en) Data processing method and device and electronic equipment
US9934147B1 (en) Content-aware storage tiering techniques within a job scheduling system
Lin et al. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache
Chen et al. Data prefetching and eviction mechanisms of in-memory storage systems based on scheduling for big data processing
CN114579269A (en) Task scheduling method and device
WO2018188959A1 (en) Method and apparatus for managing events in a network that adopts event-driven programming framework
Ullah et al. Task Priority‐Based Cached‐Data Prefetching and Eviction Mechanisms for Performance Optimization of Edge Computing Clusters
CN113918098A (en) Data processing method, device, equipment and medium
CN108052536B (en) File system of IoT (Internet of things) equipment
CN113448897B (en) Optimization method suitable for pure user mode far-end direct memory access
US11755534B2 (en) Data caching method and node based on hyper-converged infrastructure
JP5791478B2 (en) Information processing device
US10678699B2 (en) Cascading pre-filter to improve caching efficiency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination