US20140208072A1 - User-level manager to handle multi-processing on many-core coprocessor-based systems - Google Patents
User-level manager to handle multi-processing on many-core coprocessor-based systems Download PDFInfo
- Publication number
- US20140208072A1 US20140208072A1 US13/858,034 US201313858034A US2014208072A1 US 20140208072 A1 US20140208072 A1 US 20140208072A1 US 201313858034 A US201313858034 A US 201313858034A US 2014208072 A1 US2014208072 A1 US 2014208072A1
- Authority
- US
- United States
- Prior art keywords
- offload
- coprocessor
- cores
- core
- processes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5033—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/509—Offload
Definitions
- the present application relates to multi-core processing.
- the Intel Xeon Phi is a recently introduced x86-based 60-core, 240-thread coprocessor that is increasingly being deployed in servers and clusters. It is easier to program than other manycore processors, and runs the Linux operating system.
- the operating system provides services such as virtual memory and context switching, and enables multiple processes to run concurrently and share the coprocessor. Multi-processing on the manycore is also necessary to fully utilize the hardware resources of the Xeon Phi.
- Xeon-Phi It is remarkably easy to offload processing the Xeon-Phi: it supports a popular ISA (x86), a popular OS (Linux), and a popular programming model (OpenMP). Unfortunately, quick and easy portability rarely results in an implementation that executes faster on the Xeon-Phi.
- a programmer employs pragmas to identify code regions to be “offloaded” to the coprocessor, for which the compiler automatically generates coprocessor instructions along with glue code to transfer data. This is referred to as the “offload programming model” where the main trunk of the code runs on the host processor while regions identified by the pragmas are offloaded to run on the Xeon Phi coprocessor.
- Such application tuning can work well for individual applications when they “own” the coprocessor, but not in a multi-processing environment.
- the Xeon Phi coprocessor runs Linux, which makes it easy for multiple processes to share the coprocessor.
- Such a use case of a manycore like the Xeon Phi where multiple processes compete for coprocessor resources is likely not only in cluster and cloud deployments, but also in servers since good hardware utilization is essential.
- directives introduced by the programmer specifically to enhance manycore coprocessor performance can be counter-productive. For instance, programmers must select the number of threads, map them to cores, ensure memory is not over-subscribed, and manage their workload across multiple coprocessors. But these can degrade performance in a multi-processing environment when one programmer is unaware of other programmers' intentions.
- Programming model such as SWARM provides an API to represent workloads as tasks, compiles them to specific processors including the Xeon Phi and uses a runtime to manage the tasks on distributed heterogeneous nodes.
- Compilers such as CAPS can produce host and Xeon Phi code starting from OpenACC directives. Libraries with parallelized high performance numerical code for the Xeon Phi have been done.
- Cluster management middleware can schedule workloads on the Xeon Phi.
- Virtualization approaches such as ScaleMP provide a hypervisor that can “virtualize” the Xeon Phi and host into a single entity visible to programmers.
- An operating system runtime (MPSS) is available on top of the Xeon Phi micro kernel (OS) to perform primitive scheduling of offloads.
- MPSS operating system runtime
- a method is disclosed to manage a multi-processor system with one or more multiple-core coprocessors by intercepting coprocessor offload infrastructure application program interface (API) calls; scheduling user processes to run on one of the coprocessors; scheduling offloads within user processes to run on one of the coprocessors; and affinitizing offloads to predetermined cores within one of the coprocessors by selecting and allocating cores to an offload, and obtaining a thread-to-core mapping from a user.
- API application program interface
- a server populated with one or more multiple core Xeon Phi coprocessors includes a manager to control user processes containing offload blocks by intercepting COI API calls and schedules user processes to run on one of the Xeon Phi coprocessors; schedules offloads within user processes to run on one of the Xeon Phi coprocessors; and affinitizes offloads to specific cores within one of the Xeon Phi coprocessors by selecting and allocating cores to an offload, and obtaining the thread-to-core mapping from the user.
- the system provides a middleware on top of the Xeon Phi micro kernel and the Intel runtime.
- the middleware handles multi-processing on Xeon Phi coprocessor-based servers by automatically avoiding thread and memory oversubscription and load balancing processes across the cores of the Xeon Phi and across several Xeon Phi coprocessors.
- the system is completely transparent to the users and requires no changes to the underlying software such as the MPSS and the Linux kernel running on the coprocessor. It uses a scheduling technique to schedule processes and Xeon Phi offload regions within processes simultaneously. It also uses algorithms to set thread affinity and load balance processes across coprocessors.
- the system achieves faster operation when multiple processes share a many integrated core coprocessor system.
- Faster operation includes end-to-end turn-around-time per process (latency), as well as the number of processes completed per unit time (throughput).
- the system protects against thread and memory over-subscription resulting in severe performance loss and crashes.
- a coprocessor manages cores such that offloads of different processes run on separate sets of cores, and offloads in the same process use the same cores (thus respecting data affinity).
- the system balances the load of multiple processes across multiple Xeon Phi coprocessors.
- the manager provides a transparent user-level middleware that includes a suite of run-time techniques explicitly designed to enhance performance portability in the presence of multi-processing.
- FIG. 1 shows an exemplary process manager for a multiprocessor system.
- FIG. 2 shows an exemplary software stack with host and co-processor components of the multi-processor software stack.
- FIG. 3 shows an exemplary flow of the process manager of FIG. 1 .
- FIG. 4 shows an exemplary architecture of the process manager of FIG. 1 .
- FIG. 5 shows an exemplary scheduling procedure for the system of FIG. 1 .
- FIG. 6 shows an exemplary method for aging-based first-fit procedure for process selection.
- FIG. 1 shows a high-level view of a process manager 20 called the COSMIC system.
- COSMIC manages offloads from several user processes 10 , 12 and 14 .
- Each user process contains several offload blocks that are executed sequentially. The process has a single memory requirement for all its offloads, while the offloads themselves have their own thread requirements. Thus, before execution, every process requests the process manager 20 for memory, and every offload requests the process manager 20 for threads.
- COSMIC arbitrates the requests by taking into consideration the different available coprocessors, the available cores within each device and the available memory. It then schedules and allocates resources for the offloads in such a way that thread and memory oversubscription are avoided, and the devices as well as the cores within them are load balanced.
- COSMIC manages processes and coprocessor resources in order to:
- one implementation requests that the programmer specify the maximum memory required on the Xeon Phi for each process. This is similar to job submission requirements in cluster schedulers. In typical cases, different offloads of the same process often share data in order to reduce data movement between the host and Xeon Phi. Thus, as long as the process exists, it will use memory on the card. However, unlike cluster schedulers, this embodiment does not require the process specify core, devices or other resources, but infers it automatically from the number of threads requested by the offload. Unlike memory that is reserved for the life of a process, threads (and cores) are given to an offload when it starts executing and released when the offload completes for use by other offloads.
- COSMIC Before execution, every process requests COSMIC for memory, and every offload requests COSMIC for threads.
- COSMIC arbitrates the requests by taking into consideration the different available coprocessors, the available cores within each device and the available memory. It then schedules and allocates resources for the offloads in such a way that thread and memory oversubscription are avoided, and the devices as well as the cores within them are load balanced.
- COSMIC has several parameters that may be set by the server administrator or user that can affect its policies and behavior.
- An administrator can configure the following parameters of COSMIC to affect its scheduling decisions:
- FIG. 2 shows a block diagram of the Xeon Phi software stack, and where COSMIC fits in.
- the left half of the figure shows the stack on the host processor, while the right half shows the stack running on the coprocessor.
- the top half represents user space, and the bottom half represents kernel space.
- the host processor runs a Linux kernel 122 with a PCI and card driver 124 - 126 to communicate with the card.
- a Symmetric Communication Interface (SCIF) driver 120 is provided for inter-node communications.
- a node can be a Xeon Phi device or the host processor.
- SCIF 120 provides a set of APIs for communication and abstracts the details of communicating over the PCIe bus.
- the Coprocessor Offload Infrastructure (COI) 112 is a higher-level framework providing a set of APIs to simplify development of applications using the offload model.
- COI provides APIs for loading and launching device code, asynchronous execution and data transfer between the host and Xeon Phi.
- the coprocessor portion of the software stack consists of a modified Linux kernel 156 , the PCI driver 154 and the standard Linux proc file system 152 that can be used to query device state (for example, the load average).
- the coprocessor portion also has a SCIF driver 158 for communicating over the PCI bus with the host and other nodes.
- the COI 112 communicates with a COSMIC host component 110 that communicates with user processes 100 - 104 .
- the host component 110 interacts with a COSMIC coprocessor component 160 that handles offloaded portions of user processes 162 .
- COSMIC host middleware component has a global view of all processes and offloads emanating from the host, and knowledge of the states of all coprocessor devices.
- COSMIC is architected to be lightweight and completely transparent to users of the Xeon Phi system. As shown in FIG. 2 , COSMIC exists in the user space, but interacts with both user processes and other kernel-level components. It controls offload scheduling and dispatch by intercepting COI API calls that are used to communicate with the device. This is a key mechanism of COSMIC that enables it to transparently gain control of how offloads are managed. We first briefly describe by way of example how offload blocks are expressed using the COI API.
- the Xeon Phi compiler converts all offload blocks that are marked by pragmas into COI calls.
- the user's program with offload pragmas is compiled using Intel's icc or a gcc cross-compiler for the Xeon Phi.
- the compiler produces a host binary, and Xeon Phi binaries for all the offload portions.
- the offload portions are first translated into a series of COI API calls. The figure shows the important calls for a simple example: first COIEngineGetCount and COIEngineGetHandle get a handle to the coprocessor specified in the pragma. Then COIProcessCreateFromFile creates a process from the binary corresponding to the offload portions.
- COIProcessGetFunctionHandles acquires the handles to these functions.
- COIPipelineCreate creates a “COI pipeline” which consists of 3 stages: one to send data to the coprocessor, one to perform the computation and one to get data back from the coprocessor. Then COIBufferCreate creates buffers necessary for inputs and outputs to the offload. In this example, three COI buffers corresponding to the arrays a, b and c are created.
- COIBufferCopy transfers data to the coprocessor, and COIPipelineRunFunction executes the function corresponding to the offload block. Finally, another COIBufferCopy gets results (i.e., array c) back from the Xeon Phi.
- COSMIC can transparently control offload scheduling and dispatch.
- COSMIC is architected as two components implemented as separate processes: the COSMIC client and the COSMIC server.
- the COSMIC client is the front-end, while the COSMIC server is the back-end consisting of the scheduler and the monitor.
- the monitor comprises a host portion and a card-side portion, as depicted in FIG. 2 . Inter-process interfaces are clearly defined: each process communicates with the other two using explicit messages.
- the COSMIC client is responsible for intercepting COI calls and communicating with the scheduler in the COSMIC server to request access to a coprocessor. It accomplishes this using library interposition. Every user process links with the Intel COI shared library that contains definitions for all API function modules.
- COSMIC intercepts and redefines every COI API function: the redefined COI functions perform COSMIC-specific tasks such as communicating with the COSMIC scheduler, and then finally calls the actual COI function.
- COSMIC creates its own shared library that is pre-loaded to the application (using either LD_PRELOAD or redefining LD_LIBRARY_PATH). The pre-loading ensures that COSMIC's library is first used to resolve any COI API function.
- the client sends the following different messages to the scheduler in the COSMIC server:
- NewProcess When an offload is first encountered for a a process, the client sends a NewProcess message to the scheduler indicating that the scheduler should account for a new process in its book-keeping. Every new process is annotated with its memory requirement provided by the user.
- NewOffload For every offload, the client sends a NewOffload message to the scheduler indicating the process to which the offload belongs and the number of threads it is requesting. It also indicates the size of the buffers that need to be transferred to the coprocessor for this offload.
- OffloadComplete When an offload completes, the client sends an OffloadComplete message to the scheduler so that it can account for the newly freed resources such as coprocessor cores and threads.
- ProcessComplete When a process completes, the client sends a ProcessComplete message to the scheduler to account for the freed memory used by the process.
- the COSMIC scheduler is the key actor in the COSMIC system and manages multiple user processes with offloads and several coprocessor devices by arbitrating access to coprocessor resources. It runs completely on the host and has global visibility into every coprocessor in the system. In scheduling offloads and allocating resources, it ensures no thread and memory oversubscription and load balances coprocessor cores and devices to most efficiently use them.
- COSMIC scheduler concurrently schedules processes and offloads within the processes.
- Each process has a memory requirement, while each offload has a thread requirement.
- Various coprocessors in the system may have different memory and thread availabilities.
- the goal of the scheduler is to schedule processes and offloads by mapping processes to Xeon Phi coprocessors and offloads to specific cores on the coprocessors.
- the scheduler also ensures fairness, i.e., makes sure all processes and offloads eventually get access to coprocessor resources.
- the scheduler is event-based, i.e., a scheduling cycle is triggered by a new event.
- a new event can be the arrival of a new process, the arrival of a new offload in an existing process, the dispatching of an offload to a Xeon Phi device, the completion of an offload or the completion of a process.
- a queue of pending processes is maintained: each arriving new process is added to the tail of the pending process queue.
- a process is eventually scheduled to one Xeon Phi coprocessor.
- the scheduler also maintains a queue of pending offloads for each Xeon Phi coprocessor in the system. Each new offload is added to the tail of the offload queue belonging to the Xeon Phi coprocessor on which its process has been scheduled.
- FIG. 3 shows the orkflow through the Xeon Phi software stack when multiple processes are issued. They are all fed to the Xeon Phi MPSS runtime, which often serializes them in order to avoid crashing the coprocessor.
- the manager 20 avoids this by intercepting COI calls at 202 , and the manager 20 takes control of the processes and offloads. Specifically, in 210 the manager 20 performs process scheduling, offloads scheduling and affinitizes offloads to specific cores on the co-processor. Once this is done, it issues the processes and offloads to the MPSS at 204 and continues with the Linux operating system 206 .
- FIG. 4 shows an exemplary architecture of the process manager of FIG. 1 .
- COSMIC is architected as two components implemented as separate processes: clients 310 - 312 that communicate with a library 316 , scheduler 320 and monitor 326 , the latter comprising a host portion and a card-side portion. Inter-process interfaces are clearly defined: each process communicates with the other two using explicit messages.
- FIG. 5 shows an exemplary scheduling procedure.
- a pending process is selected and scheduled to a coprocessor that has enough free memory.
- offload queues corresponding to each Xeon Phi are examined, and the scheduler dispatches an offload to each coprocessor if it has enough free threads. Both processes and offloads are selected based on an aging-based first-fit heuristic.
- a pending process is selected and scheduled to a coprocessor that has enough free memory. Then offload queues corresponding to each Xeon Phi are examined, and the scheduler dispatches an offload to each coprocessor if it has enough free threads. Both processes and offloads are selected based on an aging-based first-fit heuristic.
- P the process at the head of the pending process queue ( 402 ).
- the scheduler maintains a circular list of the Xeon Phi coprocessors in the system.
- D the next coprocessor in the list ( 404 ).
- the scheduler checks to see if the memory required by P fits in the available memory of D ( 406 ). If it does, P is removed from the queue and dispatched to D ( 408 ). If not, the next coprocessor in the circular list is examined ( 410 ). If P does not fit in any coprocessor, its age is incremented, and the next pending process is examined ( 412 ). When a process' age reaches a threshold, all scheduling is blocked until that process is scheduled ( 414 ). This ensures fairness since all processes will get a chance at being scheduled.
- Scheduling an offload is similar to scheduling a process, with one difference. Instead of memory, an offload has a thread requirement; COSMIC checks if the threads requested by an offload are available on the coprocessor on which the offload's owner process has been scheduled. If so, the offload is dispatched. If not, it increments the offload's age, and examines the next offload in the queue.
- An administrator can specify the following parameters to tailor the scheduler's behavior: (i) aging threshold, (ii) thread over-scheduling factor and (iii) memory over-scheduling factor. The latter two indicate to what extent threads and memory may be oversubscribed.
- the COSMIC monitor collects data pertaining to the state of the coprocessors. It has a host-side component, and a component that is resident on each of the coprocessors.
- the host-side component is primarily responsible for communicating with the scheduler and all the coprocessor-side components.
- the coprocessor-side components monitor the load on the device, the number of threads requested by each offload and the health of each offload process. If a process dies for any reason, it catches it and reports the reason to the COSMIC scheduler.
- COSMIC selects the cores that are used by an offload, and affinitizes threads to these cores using programmer directives.
- the core selection procedure for an offload is discussed next.
- COSMIC's core selection algorithm scans one or more lists of free physical cores to select cores until it finds enough cores for a given offload region.
- the number of cores assigned to an offload region is the number of threads used by the offload region divided by the thread-to-core ratio N.
- the order of the core lists from which COSMIC selects cores reflects the preference of COSMIC's core selection strategy.
- the first list of physical cores for a new offload region consists of cores that are both free and only used by the earlier offloads coming from the same offload process. If more physical cores are needed, COSMIC picks from a second list of physical cores, which are both free and not currently assigned to other offload processes. If still more cores are needed, COSMIC forms a third list of physical cores, which are the remaining free cores not yet selected.
- COSMIC To ensure efficient execution of offloads COSMIC adopts several policies to make sure the hardware resources of a Xeon Phi processor are well utilized but not oversubscribed. These policies and their implementations are discussed below.
- COSMIC limits the total number of actively running software threads on a Xeon Phi device to ensure that the device's physical cores are not oversubscribed. When an offload region is running, all of the threads spawned by the offload process are considered active. Otherwise the threads are considered dormant. COSMIC keeps track of the number of active and inactive software threads spawned by offload processes on a Xeon Phi device. It maintains the ratio between the total number of active software threads spawned by all offload processes and the physical cores to be no more than an integer N, which is configured as 4 in our current implementation of COSMIC. Therefore COSMIC only schedules an offload region to run if the thread-to-core ratio will not exceed N after the offload region starts.
- COSMIC uses several mechanisms to detect the number of software threads created by an offload process on a Xeon Phi device.
- First COSMIC inspects the environment variable MIC_OMP_NUM_THREADS of a submitted job.
- COSMIC also intercepts omp_set_num_threads_target( ) function calls on the host.
- Finally on a Xeon Phi device an offload process's call to omp_set_num_threads( ) is also intercepted and the number of threads is reported back the host COSMIC process.
- COSMIC relies on the OMP library to get the number of physical cores on a Xeon Phi device.
- the monitor running on each Xeon Phi device queries the maximum number of threads by calling omp_get_max_threads( ) and communicates the returned value to COMSIC's host process.
- COSMIC then divides this number by 4 to obtain the number of the processing cores on a Xeon Phi device. Notice that the number of physical cores derived in this approach is generally one less than the real number of the physical cores. We believe this is because one core is reserved for the OS, and thus we do not adjust the derived number.
- COSMIC creates physical-core containers and sets thread affinity automatically. The goal is to avoid threads migrating from one core to another so data in the registers and the private L1 cache can be reused, and to have multiple concurrent offload processes so the overall system utilization remains high.
- COSMIC assigns different, non-overlapping sets of physical cores to execute concurrent offload regions.
- the selected cores constitute a physical-core container that the offload can use exclusively.
- the physical-core containers minimize the interference between multiple concurrent offloads on the same Xeon Phi device due to resource contention.
- the physical-core container expires after the execution of its offload region completes, and the assigned cores are released for use by other offloads. Notice that a physical core on a Xeon Phi device can be assigned to multiple offload processes but only used by at most one active offload region at any given point of time.
- a free core is a core not currently used by any offload region (but it may be assigned to one or more offload processes.)
- COSMIC's core selection algorithm scans one or more lists of free physical cores to select cores until it finds enough cores for a given offload region.
- the number of cores assigned to an offload region is the number of threads used by the offload region divided by the thread-to-core ratio N.
- the order of the core lists from which COSMIC selects cores reflects the preference of COSMIC's core selection strategy.
- the first list of physical cores for a new offload region consists of cores that are both free and only used by the earlier offloads coming from the same offload process. If more physical cores are needed, COSMIC picks from a second list of physical cores, which are both free and not currently assigned to other offload processes. If still more cores are needed, COSMIC forms a third list of physical cores, which are the remaining free cores not yet selected.
- the core selection algorithm uses simple linear arrays to track of the status of the physical cores so that it can efficiently construct various lists of physical cores to choose from. For each Xeon Phi device the algorithm maintains two arrays, called F and G. Entry i of an array stores the status of physical core i. Array F indicates which cores are free, and array G records the number of the offload processes that a core is assigned to. Further the algorithm also maintains for each active offload process an array, called P, listing the cores assigned to the process's latest offload region (P[i] is 1 iff core i is assigned.)
- the algorithm constructs the various lists of cores to select for an offload region.
- COSMIC ensures that the overall memory requirement of the offload processes running on a single Xeon Phi device does not exceed the amount of the device's physical memory.
- COSMIC keeps track of the amount of available physical memory for each Xeon Phi device.
- COSMIC queries each MIC device the amount of free physical memory using a COI function (COIEngineGetInfo).
- COIEngineGetInfo When a user submits a job to COSMIC, the user needs to inform COSMIC the total amount of memory the process needs on a Xeon Phi device through an environment variable COSMIC_MEMORY.
- COSMIC only launches a submitted job if there is one Xeon Phi device with enough free memory to meet the memory requirement of the job.
- COSMIC can be optionally configured to terminate any running process that uses more Xeon Phi memory than the amount specified by the user.
- COSMIC relies on Linux's memory resource controller to set up a memory container for each offload process on a Xeon Phi device. Each container limits the real committed memory usage of the offload process to the user-specified maximum value. If a process's memory footprint goes over the limit, the memory resource controller invokes Linux's out-of-memory killer (oom-killer) to terminate the offending process.
- the memory resource controller is not enabled in the default Xeon Phi OS kernel. To install a new kernel with the memory resource controller requires adding one line to the kernel configuration file, recompiling the kernel, and rebooting Xeon Phi cards with the new kernel image.
- the runtime performance overhead due to using the Linux memory controller ranges from negligible to about 5% in real applications.
- a many integrated cores (MIC) co-processor can have the cores, PCIe Interface logic, and GDDR5 memory controllers are connected via an Interprocessor Network (IPN) ring, which can be thought of as independent bidirectional ring.
- IPN Interprocessor Network
- the L2 caches are shown as slices per core, but can also be thought of as a fully coherent cache, with a total size equal to the sum of the slices.
- Information can be copied to each core that uses it to provide the fastest possible local access, or a single copy can be present for all cores to provide maximum cache capacity.
- the co-processor is the Intel® Xeon PhiTM coprocessor that can support up to 61 cores (making a 31 MB L2) cache) and 8 memory controllers with 2 GDDR5 channels each.
- SDA Shortest Distance Algorithm
- Co-resident with each core structure is a portion of a distributed tag directory. These tags are hashed to distribute workloads across the enabled cores. Physical addresses are also hashed to distribute memory accesses across the memory controllers.
- Each Xeon Phi core is dual-issue in-order, and includes 16 32-bit vector lanes. The performance of each core on sequential code is considerably slower than its multi-core counterpart. However, each core supports 4 hardware threads, resulting in good aggregate performance for highly parallelized and vectorized kernels. This makes the offload model, where sequential code runs on the host processor and parallelizable kernels are offloaded to the Xeon Phi, a suitable programming model.
- the Xeon Phi software stack consists of a host portion and coprocessor portion.
- the host portion asynchronous execution and data transfer between the host and Xeon Phi.
- the coprocessor portion of the software stack consists of a modified Linux kernel, drivers and the standard Linux proc file system that can be used to query device state (for example, the load average).
- the coprocessor portion also has a SCIF driver to communicate over the PCI bus with the host and other nodes.
- the current Xeon Phi software stack is referred to as the Many Integrated Core (MIC) Platform Software Stack or MPSS for short.
- MIC Many Integrated Core
- the invention may be implemented in hardware, firmware or software, or a combination of the three.
- the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
- Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein.
- the inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
Abstract
A method is disclosed to manage a multi-processor system with one or more multiple-core coprocessors by intercepting coprocessor offload infrastructure application program interface (API) calls; scheduling user processes to run on one of the coprocessors; scheduling offloads within user processes to run on one of the coprocessors; and affinitizing offloads to predetermined cores within one of the coprocessors by selecting and allocating cores to an offload, and obtaining a thread-to-core mapping from a user.
Description
- This application is a non-provisional of and claims priority to provisional applications with Ser. No. 61/754,371 filed on Jan. 18, 2013 and Ser. Nos. 61/761,969 and 61/761,985 both filed on Feb. 7, 2013, the contents of which are incorporated by reference.
- The present application relates to multi-core processing.
- The Intel Xeon Phi is a recently introduced x86-based 60-core, 240-thread coprocessor that is increasingly being deployed in servers and clusters. It is easier to program than other manycore processors, and runs the Linux operating system. The operating system provides services such as virtual memory and context switching, and enables multiple processes to run concurrently and share the coprocessor. Multi-processing on the manycore is also necessary to fully utilize the hardware resources of the Xeon Phi.
- It is remarkably easy to offload processing the Xeon-Phi: it supports a popular ISA (x86), a popular OS (Linux), and a popular programming model (OpenMP). Unfortunately, quick and easy portability rarely results in an implementation that executes faster on the Xeon-Phi. In order to use the Xeon Phi, a programmer employs pragmas to identify code regions to be “offloaded” to the coprocessor, for which the compiler automatically generates coprocessor instructions along with glue code to transfer data. This is referred to as the “offload programming model” where the main trunk of the code runs on the host processor while regions identified by the pragmas are offloaded to run on the Xeon Phi coprocessor. Thus, if portions of a program are already parallelized using OpenMP, porting them to the Xeon Phi is easy. Unfortunately, quick and easy portability rarely results in an implementation that executes faster on the Xeon Phi. Rather, additional programmer effort such as carefully selecting the number of threads, mapping them to cores, ensuring no thread or memory oversubscription, and load balancing the application across multiple coprocessors is necessary.
- Such application tuning can work well for individual applications when they “own” the coprocessor, but not in a multi-processing environment. The Xeon Phi coprocessor runs Linux, which makes it easy for multiple processes to share the coprocessor. Such a use case of a manycore like the Xeon Phi where multiple processes compete for coprocessor resources is likely not only in cluster and cloud deployments, but also in servers since good hardware utilization is essential.
- In a multi-processing environment, multiple application processes compete for coprocessor resources, and one programmer is unaware of another programmer's intentions. Thus any programmer-driven steps taken to improve performance can in fact degrade it. In such an environment, processes must adhere to the following guidelines in order to avoid performance degradation and benefit from the manycore.
- Additional programmer effort such as carefully selecting the number of threads, mapping them to cores, ensuring memory is not over-subscribed, and using multiple coprocessors well are necessary to tune application performance But programmer directives alone are insufficient: multi-processing on the manycore is required to improve hardware utilization. Linux makes it easy for processes to share the Xeon Phi, but in an environment where applications compete for manycore resources, any programmer effort intended to boost individual application performance can in fact end up doing the opposite.
- In a multi-processing environment, directives introduced by the programmer specifically to enhance manycore coprocessor performance can be counter-productive. For instance, programmers must select the number of threads, map them to cores, ensure memory is not over-subscribed, and manage their workload across multiple coprocessors. But these can degrade performance in a multi-processing environment when one programmer is unaware of other programmers' intentions.
- Various solutions are available to help programmers take advantage of the co-processor. Programming model such as SWARM provides an API to represent workloads as tasks, compiles them to specific processors including the Xeon Phi and uses a runtime to manage the tasks on distributed heterogeneous nodes. Compilers such as CAPS can produce host and Xeon Phi code starting from OpenACC directives. Libraries with parallelized high performance numerical code for the Xeon Phi have been done. Cluster management middleware can schedule workloads on the Xeon Phi. Virtualization approaches such as ScaleMP provide a hypervisor that can “virtualize” the Xeon Phi and host into a single entity visible to programmers. An operating system runtime (MPSS) is available on top of the Xeon Phi micro kernel (OS) to perform primitive scheduling of offloads.
- In one aspect, a method is disclosed to manage a multi-processor system with one or more multiple-core coprocessors by intercepting coprocessor offload infrastructure application program interface (API) calls; scheduling user processes to run on one of the coprocessors; scheduling offloads within user processes to run on one of the coprocessors; and affinitizing offloads to predetermined cores within one of the coprocessors by selecting and allocating cores to an offload, and obtaining a thread-to-core mapping from a user.
- In another aspect, a server populated with one or more multiple core Xeon Phi coprocessors includes a manager to control user processes containing offload blocks by intercepting COI API calls and schedules user processes to run on one of the Xeon Phi coprocessors; schedules offloads within user processes to run on one of the Xeon Phi coprocessors; and affinitizes offloads to specific cores within one of the Xeon Phi coprocessors by selecting and allocating cores to an offload, and obtaining the thread-to-core mapping from the user.
- Implementations of the above system can
-
- Employ an aging-based first-fit algorithm for process and offload scheduling
- Use thread and memory over-scheduling factors to enhance performance
- Employ a greedy algorithm for core selection such that offloads from the same process get preference to use the same cores
- Use bitmaps to enhance the performance of core selection algorithm.
- Advantages of the above system may include one or more of the following. The system provides a middleware on top of the Xeon Phi micro kernel and the Intel runtime. The middleware handles multi-processing on Xeon Phi coprocessor-based servers by automatically avoiding thread and memory oversubscription and load balancing processes across the cores of the Xeon Phi and across several Xeon Phi coprocessors. The system is completely transparent to the users and requires no changes to the underlying software such as the MPSS and the Linux kernel running on the coprocessor. It uses a scheduling technique to schedule processes and Xeon Phi offload regions within processes simultaneously. It also uses algorithms to set thread affinity and load balance processes across coprocessors.
- The system achieves faster operation when multiple processes share a many integrated core coprocessor system. Faster operation includes end-to-end turn-around-time per process (latency), as well as the number of processes completed per unit time (throughput).
- The system protects against thread and memory over-subscription resulting in severe performance loss and crashes. Within a coprocessor, it manages cores such that offloads of different processes run on separate sets of cores, and offloads in the same process use the same cores (thus respecting data affinity). The system balances the load of multiple processes across multiple Xeon Phi coprocessors. The manager provides a transparent user-level middleware that includes a suite of run-time techniques explicitly designed to enhance performance portability in the presence of multi-processing.
-
FIG. 1 shows an exemplary process manager for a multiprocessor system. -
FIG. 2 shows an exemplary software stack with host and co-processor components of the multi-processor software stack. -
FIG. 3 shows an exemplary flow of the process manager ofFIG. 1 . -
FIG. 4 shows an exemplary architecture of the process manager ofFIG. 1 . -
FIG. 5 shows an exemplary scheduling procedure for the system ofFIG. 1 . -
FIG. 6 shows an exemplary method for aging-based first-fit procedure for process selection. -
FIG. 1 shows a high-level view of aprocess manager 20 called the COSMIC system. In a server with multiple Xeon Phi 30, 32 and 34, each with different amounts of memory and cores, COSMIC manages offloads fromseveral user processes process manager 20 for memory, and every offload requests theprocess manager 20 for threads. COSMIC arbitrates the requests by taking into consideration the different available coprocessors, the available cores within each device and the available memory. It then schedules and allocates resources for the offloads in such a way that thread and memory oversubscription are avoided, and the devices as well as the cores within them are load balanced. COSMIC manages processes and coprocessor resources in order to: -
- Avoid over-subscribing coprocessor hardware threads.
- Avoid over-subscribing and carefully manage the limited coprocessor main memory.
- Map threads to cores, ensuring minimal thread migration, while respecting data affinity and persistence across offload regions.
- Load balance applications across multiple coprocessors while ensuring locality of data.
Given a server populated with multiple integrated core coprocessors such as the Xeon Phi coprocessors, with several users and processes competing for coprocessor resources, the goals of theprocess manager 20 are to manage processes and coprocessor resources and: - Avoid over-subscribing coprocessor hardware threads.
- Avoid over-subscribing and carefully manage the limited coprocessor main memory.
- Map threads to cores, ensuring minimal thread migration, while respecting data affinity and persistence across offload regions.
- Load balance applications across multiple coprocessors while ensuring locality of data.
- To simplify memory management, one implementation requests that the programmer specify the maximum memory required on the Xeon Phi for each process. This is similar to job submission requirements in cluster schedulers. In typical cases, different offloads of the same process often share data in order to reduce data movement between the host and Xeon Phi. Thus, as long as the process exists, it will use memory on the card. However, unlike cluster schedulers, this embodiment does not require the process specify core, devices or other resources, but infers it automatically from the number of threads requested by the offload. Unlike memory that is reserved for the life of a process, threads (and cores) are given to an offload when it starts executing and released when the offload completes for use by other offloads.
- Before execution, every process requests COSMIC for memory, and every offload requests COSMIC for threads. COSMIC arbitrates the requests by taking into consideration the different available coprocessors, the available cores within each device and the available memory. It then schedules and allocates resources for the offloads in such a way that thread and memory oversubscription are avoided, and the devices as well as the cores within them are load balanced.
- COSMIC has several parameters that may be set by the server administrator or user that can affect its policies and behavior. An administrator can configure the following parameters of COSMIC to affect its scheduling decisions:
-
- Aging threshold: how many times the scheduler attempts a process or offload before progress is blocked.
- Thread factor To (1 or larger): the offload “fit” function considers To*hardware threads as the total number of threads in a coprocessor. If To is greater than 1, the number of threads is oversubscribed in a measured way to leverage the fact that slight oversubscriptions may actually be beneficial since otherwise offloads may have to wait longer.
- Memory factor Mo (1 or larger): the process “fit” function considers Mo*physical memory as the total physical memory of the coprocessor. If Mo is greater than 1, the memory is oversubscribed in a measured way to leverage the fact that not all processes will require their maximum requested memory at the same time.
COSMIC also expects from the owner of each process the following directives: - Memory limit: the peak memory the process will use over its lifetime. COSMIC kills any process that exceeds its memory usage as described later in this section.
- Preferred thread affinity: In order to allocate Xeon Phi cores for an offload, COSMIC needs to know how user threads must be mapped to cores. A SCATTER mapping indicates 1 thread per core, a COMPACT mapping 4 threads per core and BALANCED 3 threads per core.
-
FIG. 2 shows a block diagram of the Xeon Phi software stack, and where COSMIC fits in. The left half of the figure shows the stack on the host processor, while the right half shows the stack running on the coprocessor. The top half represents user space, and the bottom half represents kernel space. The host processor runs aLinux kernel 122 with a PCI and card driver 124-126 to communicate with the card. Along with the operating system, a Symmetric Communication Interface (SCIF)driver 120 is provided for inter-node communications. A node can be a Xeon Phi device or the host processor.SCIF 120 provides a set of APIs for communication and abstracts the details of communicating over the PCIe bus. On top ofSCIF 120, the Coprocessor Offload Infrastructure (COI) 112 is a higher-level framework providing a set of APIs to simplify development of applications using the offload model. COI provides APIs for loading and launching device code, asynchronous execution and data transfer between the host and Xeon Phi. The coprocessor portion of the software stack consists of a modifiedLinux kernel 156, thePCI driver 154 and the standard Linuxproc file system 152 that can be used to query device state (for example, the load average). The coprocessor portion also has aSCIF driver 158 for communicating over the PCI bus with the host and other nodes. TheCOI 112 communicates with aCOSMIC host component 110 that communicates with user processes 100-104. Thehost component 110 interacts with aCOSMIC coprocessor component 160 that handles offloaded portions of user processes 162. - The COSMIC host middleware component has a global view of all processes and offloads emanating from the host, and knowledge of the states of all coprocessor devices. COSMIC is architected to be lightweight and completely transparent to users of the Xeon Phi system. As shown in
FIG. 2 , COSMIC exists in the user space, but interacts with both user processes and other kernel-level components. It controls offload scheduling and dispatch by intercepting COI API calls that are used to communicate with the device. This is a key mechanism of COSMIC that enables it to transparently gain control of how offloads are managed. We first briefly describe by way of example how offload blocks are expressed using the COI API. - The Xeon Phi compiler converts all offload blocks that are marked by pragmas into COI calls. The user's program with offload pragmas is compiled using Intel's icc or a gcc cross-compiler for the Xeon Phi. The compiler produces a host binary, and Xeon Phi binaries for all the offload portions. The offload portions are first translated into a series of COI API calls. The figure shows the important calls for a simple example: first COIEngineGetCount and COIEngineGetHandle get a handle to the coprocessor specified in the pragma. Then COIProcessCreateFromFile creates a process from the binary corresponding to the offload portions. Each offload block is represented as a function, and COIProcessGetFunctionHandles acquires the handles to these functions. COIPipelineCreate creates a “COI pipeline” which consists of 3 stages: one to send data to the coprocessor, one to perform the computation and one to get data back from the coprocessor. Then COIBufferCreate creates buffers necessary for inputs and outputs to the offload. In this example, three COI buffers corresponding to the arrays a, b and c are created. COIBufferCopy transfers data to the coprocessor, and COIPipelineRunFunction executes the function corresponding to the offload block. Finally, another COIBufferCopy gets results (i.e., array c) back from the Xeon Phi.
- COSMIC is architected as two components implemented as separate processes: the COSMIC client and the COSMIC server. The COSMIC client is the front-end, while the COSMIC server is the back-end consisting of the scheduler and the monitor. The monitor comprises a host portion and a card-side portion, as depicted in
FIG. 2 . Inter-process interfaces are clearly defined: each process communicates with the other two using explicit messages. - The COSMIC client is responsible for intercepting COI calls and communicating with the scheduler in the COSMIC server to request access to a coprocessor. It accomplishes this using library interposition. Every user process links with the Intel COI shared library that contains definitions for all API function modules. COSMIC intercepts and redefines every COI API function: the redefined COI functions perform COSMIC-specific tasks such as communicating with the COSMIC scheduler, and then finally calls the actual COI function. With the redefined functions, COSMIC creates its own shared library that is pre-loaded to the application (using either LD_PRELOAD or redefining LD_LIBRARY_PATH). The pre-loading ensures that COSMIC's library is first used to resolve any COI API function.
- Based on the type of COI API intercepted, the client sends the following different messages to the scheduler in the COSMIC server:
- NewProcess: When an offload is first encountered for a a process, the client sends a NewProcess message to the scheduler indicating that the scheduler should account for a new process in its book-keeping. Every new process is annotated with its memory requirement provided by the user.
- NewOffload: For every offload, the client sends a NewOffload message to the scheduler indicating the process to which the offload belongs and the number of threads it is requesting. It also indicates the size of the buffers that need to be transferred to the coprocessor for this offload.
- OffloadComplete: When an offload completes, the client sends an OffloadComplete message to the scheduler so that it can account for the newly freed resources such as coprocessor cores and threads.
- ProcessComplete: When a process completes, the client sends a ProcessComplete message to the scheduler to account for the freed memory used by the process.
- The COSMIC scheduler is the key actor in the COSMIC system and manages multiple user processes with offloads and several coprocessor devices by arbitrating access to coprocessor resources. It runs completely on the host and has global visibility into every coprocessor in the system. In scheduling offloads and allocating resources, it ensures no thread and memory oversubscription and load balances coprocessor cores and devices to most efficiently use them.
- A key distinction between the COSMIC scheduler and traditional operating system schedulers is that COSMIC concurrently schedules processes and offloads within the processes. Each process has a memory requirement, while each offload has a thread requirement. Various coprocessors in the system may have different memory and thread availabilities.
- Under these constraints, the goal of the scheduler is to schedule processes and offloads by mapping processes to Xeon Phi coprocessors and offloads to specific cores on the coprocessors. The scheduler also ensures fairness, i.e., makes sure all processes and offloads eventually get access to coprocessor resources.
- The scheduler is event-based, i.e., a scheduling cycle is triggered by a new event. A new event can be the arrival of a new process, the arrival of a new offload in an existing process, the dispatching of an offload to a Xeon Phi device, the completion of an offload or the completion of a process. A queue of pending processes is maintained: each arriving new process is added to the tail of the pending process queue. A process is eventually scheduled to one Xeon Phi coprocessor. The scheduler also maintains a queue of pending offloads for each Xeon Phi coprocessor in the system. Each new offload is added to the tail of the offload queue belonging to the Xeon Phi coprocessor on which its process has been scheduled.
-
FIG. 3 shows the orkflow through the Xeon Phi software stack when multiple processes are issued. They are all fed to the Xeon Phi MPSS runtime, which often serializes them in order to avoid crashing the coprocessor. Themanager 20 avoids this by intercepting COI calls at 202, and themanager 20 takes control of the processes and offloads. Specifically, in 210 themanager 20 performs process scheduling, offloads scheduling and affinitizes offloads to specific cores on the co-processor. Once this is done, it issues the processes and offloads to the MPSS at 204 and continues with theLinux operating system 206. -
FIG. 4 shows an exemplary architecture of the process manager ofFIG. 1 . COSMIC is architected as two components implemented as separate processes: clients 310-312 that communicate with alibrary 316,scheduler 320 and monitor 326, the latter comprising a host portion and a card-side portion. Inter-process interfaces are clearly defined: each process communicates with the other two using explicit messages. -
FIG. 5 shows an exemplary scheduling procedure. When a new event occurs, a pending process is selected and scheduled to a coprocessor that has enough free memory. Then offload queues corresponding to each Xeon Phi are examined, and the scheduler dispatches an offload to each coprocessor if it has enough free threads. Both processes and offloads are selected based on an aging-based first-fit heuristic. - When a new event occurs, a pending process is selected and scheduled to a coprocessor that has enough free memory. Then offload queues corresponding to each Xeon Phi are examined, and the scheduler dispatches an offload to each coprocessor if it has enough free threads. Both processes and offloads are selected based on an aging-based first-fit heuristic.
- At the start of a scheduling cycle, let P be the process at the head of the pending process queue (402). The scheduler maintains a circular list of the Xeon Phi coprocessors in the system. Let D be the next coprocessor in the list (404). The scheduler checks to see if the memory required by P fits in the available memory of D (406). If it does, P is removed from the queue and dispatched to D (408). If not, the next coprocessor in the circular list is examined (410). If P does not fit in any coprocessor, its age is incremented, and the next pending process is examined (412). When a process' age reaches a threshold, all scheduling is blocked until that process is scheduled (414). This ensures fairness since all processes will get a chance at being scheduled.
- Scheduling an offload is similar to scheduling a process, with one difference. Instead of memory, an offload has a thread requirement; COSMIC checks if the threads requested by an offload are available on the coprocessor on which the offload's owner process has been scheduled. If so, the offload is dispatched. If not, it increments the offload's age, and examines the next offload in the queue.
- An administrator can specify the following parameters to tailor the scheduler's behavior: (i) aging threshold, (ii) thread over-scheduling factor and (iii) memory over-scheduling factor. The latter two indicate to what extent threads and memory may be oversubscribed.
- The COSMIC monitor collects data pertaining to the state of the coprocessors. It has a host-side component, and a component that is resident on each of the coprocessors. The host-side component is primarily responsible for communicating with the scheduler and all the coprocessor-side components. The coprocessor-side components monitor the load on the device, the number of threads requested by each offload and the health of each offload process. If a process dies for any reason, it catches it and reports the reason to the COSMIC scheduler.
- Next, the affinity setting is discussed. COSMIC selects the cores that are used by an offload, and affinitizes threads to these cores using programmer directives. The core selection procedure for an offload is discussed next. COSMIC's core selection algorithm scans one or more lists of free physical cores to select cores until it finds enough cores for a given offload region. The number of cores assigned to an offload region is the number of threads used by the offload region divided by the thread-to-core ratio N. The order of the core lists from which COSMIC selects cores reflects the preference of COSMIC's core selection strategy. The first list of physical cores for a new offload region consists of cores that are both free and only used by the earlier offloads coming from the same offload process. If more physical cores are needed, COSMIC picks from a second list of physical cores, which are both free and not currently assigned to other offload processes. If still more cores are needed, COSMIC forms a third list of physical cores, which are the remaining free cores not yet selected.
- To ensure efficient execution of offloads COSMIC adopts several policies to make sure the hardware resources of a Xeon Phi processor are well utilized but not oversubscribed. These policies and their implementations are discussed below.
- COSMIC limits the total number of actively running software threads on a Xeon Phi device to ensure that the device's physical cores are not oversubscribed. When an offload region is running, all of the threads spawned by the offload process are considered active. Otherwise the threads are considered dormant. COSMIC keeps track of the number of active and inactive software threads spawned by offload processes on a Xeon Phi device. It maintains the ratio between the total number of active software threads spawned by all offload processes and the physical cores to be no more than an integer N, which is configured as 4 in our current implementation of COSMIC. Therefore COSMIC only schedules an offload region to run if the thread-to-core ratio will not exceed N after the offload region starts.
- COSMIC uses several mechanisms to detect the number of software threads created by an offload process on a Xeon Phi device. First COSMIC inspects the environment variable MIC_OMP_NUM_THREADS of a submitted job. COSMIC also intercepts omp_set_num_threads_target( ) function calls on the host. Finally on a Xeon Phi device an offload process's call to omp_set_num_threads( ) is also intercepted and the number of threads is reported back the host COSMIC process.
- COSMIC relies on the OMP library to get the number of physical cores on a Xeon Phi device. During COSMIC's initialization, the monitor (running on each Xeon Phi device) queries the maximum number of threads by calling omp_get_max_threads( ) and communicates the returned value to COMSIC's host process. COSMIC then divides this number by 4 to obtain the number of the processing cores on a Xeon Phi device. Notice that the number of physical cores derived in this approach is generally one less than the real number of the physical cores. We believe this is because one core is reserved for the OS, and thus we do not adjust the derived number.
- To further improve the efficiency of offload executions, COSMIC creates physical-core containers and sets thread affinity automatically. The goal is to avoid threads migrating from one core to another so data in the registers and the private L1 cache can be reused, and to have multiple concurrent offload processes so the overall system utilization remains high.
- COSMIC assigns different, non-overlapping sets of physical cores to execute concurrent offload regions. For an offload region the selected cores constitute a physical-core container that the offload can use exclusively. The physical-core containers minimize the interference between multiple concurrent offloads on the same Xeon Phi device due to resource contention. The physical-core container expires after the execution of its offload region completes, and the assigned cores are released for use by other offloads. Notice that a physical core on a Xeon Phi device can be assigned to multiple offload processes but only used by at most one active offload region at any given point of time. A free core is a core not currently used by any offload region (but it may be assigned to one or more offload processes.)
- COSMIC's core selection algorithm scans one or more lists of free physical cores to select cores until it finds enough cores for a given offload region. The number of cores assigned to an offload region is the number of threads used by the offload region divided by the thread-to-core ratio N. The order of the core lists from which COSMIC selects cores reflects the preference of COSMIC's core selection strategy. The first list of physical cores for a new offload region consists of cores that are both free and only used by the earlier offloads coming from the same offload process. If more physical cores are needed, COSMIC picks from a second list of physical cores, which are both free and not currently assigned to other offload processes. If still more cores are needed, COSMIC forms a third list of physical cores, which are the remaining free cores not yet selected.
- The core selection algorithm uses simple linear arrays to track of the status of the physical cores so that it can efficiently construct various lists of physical cores to choose from. For each Xeon Phi device the algorithm maintains two arrays, called F and G. Entry i of an array stores the status of physical core i. Array F indicates which cores are free, and array G records the number of the offload processes that a core is assigned to. Further the algorithm also maintains for each active offload process an array, called P, listing the cores assigned to the process's latest offload region (P[i] is 1 iff core i is assigned.)
- From the status arrays the algorithm constructs the various lists of cores to select for an offload region. For an offload region, the first list of cores to be considered (the most preferred) consists of core i such that F[i]=1, G[i]=1, and the offload process's P[i]=1. If not enough cores can be found from the first list, the algorithm creates a second list of cores, where a core i on the list satisfies F[i]=1, G[i]=0 (and P[i]=0, which is implied by G[i]=0). Finally the third list of cores (the least preferred) consist of any core i such that F[i]=1 (and G[i]>1, that is, core i is assigned to more than one offload processes).
- These status arrays are updated after offload starts and completes or when an offload process finishes. The initial values of the entries in F and G are 1 and 0, respectively. The update rules are as follows:
-
- 1. If core i is selected for an offload region, F[i] is set to 0 after the offload region starts, and to 1 after the offload region ends.
- 2. After an offload region starts, G[i] is incremented by 1 if core i is selected and P[i] is 0, or decremented by 1 if core i is not selected and P[i] is 1. In any other case G[i] is unchanged. Following the adjustment of values in G, the values in P are adjusted: P[i]=1 if and only if core i is selected.
- 3. When an offload process finishes, G[i] is decremented by 1 if P[i] is 1 for each core i, otherwise G[i] is unchanged.
- COSMIC uses the thread affinity API of the OpenMP library to keep the software threads of an offload region running on the selected cores. This thread-to-core binding procedure is conducted before or at the beginning of every offload region after the number of software threads is detected and the cores are selected. Since the thread binding must be conducted in the offload process at which the binding is targeted, COSMIC preloads every offload process a special function to perform the binding and calls the function when the binding is required. The binding is similar to the “compact” option provided by OpenMP: each core is assigned N=4 threads.
- Oversubscribing Xeon Phi memory can lead to two undesirable results: application crashes or excessive performance degradation due to memory swapping. Therefore COSMIC ensures that the overall memory requirement of the offload processes running on a single Xeon Phi device does not exceed the amount of the device's physical memory.
- To avoid memory oversubscription COSMIC keeps track of the amount of available physical memory for each Xeon Phi device. When COSMIC starts running, it queries each MIC device the amount of free physical memory using a COI function (COIEngineGetInfo). When a user submits a job to COSMIC, the user needs to inform COSMIC the total amount of memory the process needs on a Xeon Phi device through an environment variable COSMIC_MEMORY. COSMIC only launches a submitted job if there is one Xeon Phi device with enough free memory to meet the memory requirement of the job.
- COSMIC can be optionally configured to terminate any running process that uses more Xeon Phi memory than the amount specified by the user. COSMIC relies on Linux's memory resource controller to set up a memory container for each offload process on a Xeon Phi device. Each container limits the real committed memory usage of the offload process to the user-specified maximum value. If a process's memory footprint goes over the limit, the memory resource controller invokes Linux's out-of-memory killer (oom-killer) to terminate the offending process.
- Enforcing this maximum memory usage rule requires an extra installation procedure and incurs minor runtime performance overhead. The memory resource controller is not enabled in the default Xeon Phi OS kernel. To install a new kernel with the memory resource controller requires adding one line to the kernel configuration file, recompiling the kernel, and rebooting Xeon Phi cards with the new kernel image. The runtime performance overhead due to using the Linux memory controller ranges from negligible to about 5% in real applications.
- A many integrated cores (MIC) co-processor can have the cores, PCIe Interface logic, and GDDR5 memory controllers are connected via an Interprocessor Network (IPN) ring, which can be thought of as independent bidirectional ring. The L2 caches are shown as slices per core, but can also be thought of as a fully coherent cache, with a total size equal to the sum of the slices. Information can be copied to each core that uses it to provide the fastest possible local access, or a single copy can be present for all cores to provide maximum cache capacity. In one embodiment, the co-processor is the Intel® Xeon Phi™ coprocessor that can support up to 61 cores (making a 31 MB L2) cache) and 8 memory controllers with 2 GDDR5 channels each. Communication around the ring follows a Shortest Distance Algorithm (SDA). Co-resident with each core structure is a portion of a distributed tag directory. These tags are hashed to distribute workloads across the enabled cores. Physical addresses are also hashed to distribute memory accesses across the memory controllers. Each Xeon Phi core is dual-issue in-order, and includes 16 32-bit vector lanes. The performance of each core on sequential code is considerably slower than its multi-core counterpart. However, each core supports 4 hardware threads, resulting in good aggregate performance for highly parallelized and vectorized kernels. This makes the offload model, where sequential code runs on the host processor and parallelizable kernels are offloaded to the Xeon Phi, a suitable programming model. The Xeon Phi software stack consists of a host portion and coprocessor portion. The host portion asynchronous execution and data transfer between the host and Xeon Phi. The coprocessor portion of the software stack consists of a modified Linux kernel, drivers and the standard Linux proc file system that can be used to query device state (for example, the load average). The coprocessor portion also has a SCIF driver to communicate over the PCI bus with the host and other nodes. Together the current Xeon Phi software stack is referred to as the Many Integrated Core (MIC) Platform Software Stack or MPSS for short.
- The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
- Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
- The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.
Claims (20)
1. A multi-processor system, comprising:
a computer populated with one or more multiple-core coprocessors; and
a management code to control user processes containing offload blocks, including code to:
intercept coprocessor offload infrastructure application program interface (API) calls;
schedule user processes to run on one of the coprocessors;
schedule offloads within user processes to run on one of the coprocessors;
select one or more cores and create a core container consisting of the selected on a coprocessor to be used by an offload; and
obtain an optional thread-to-core mapping from a user and bind threads of an offload to the selected cores.
2. The system of claim 1 , comprising one or more procedures, each of which selecting a process and offloading the process for scheduling.
3. The system of claim 1 , comprising code with thread over-scheduling factors to enhance performance.
4. The system of claim 1 , comprising code with memory over-scheduling factors to enhance performance.
5. The system of claim 1 , comprising code to select cores to be assigned to an offload.
6. The system of claim 1 , comprising code to apply instances of data structures from which a status of cores and information about selected cores are used to enhance core selection.
7. The system of claim 1 , comprising code to communicate using messages between components of the management code.
8. The system of claim 1 , comprising providing management code client module to control each user process and monitor processes from the host-side.
9. The system of claim 1 , wherein each coprocessor comprise a plurality of cores running an operating system, one or more buses connecting one or more host processors and each coprocessor.
10. The system of claim 1 , wherein the management code:
selects and schedules a pending process to a coprocessor with free memory; and
examines offload queues corresponding to each coprocessor, and dispatches an offload to each coprocessor with free threads,
wherein processes and offloads are selected based on an aging-based first-fit heuristic.
11. A method to manage a multi-processor system with one or more multiple-core coprocessors, comprising:
intercepting coprocessor offload infrastructure application program interface (API) calls;
scheduling user processes to run on one of the coprocessors;
scheduling offloads within user processes to run on one of the coprocessors; and
affinitizing offloads to predetermined cores within one of the coprocessors by selecting and allocating cores to an offload, and obtaining a thread-to-mapping from a user.
12. The method of claim 11 , comprising applying an aging-based first-fit procedure for process and offload scheduling.
13. The method of claim 11 , comprising applying thread or memory over-scheduling factors to enhance performance.
14. The method of claim 11 , comprising applying greedy core selection such that offloads from the same process get preference to use the same cores.
15. The method of claim 11 , comprising applying bitmaps to enhance core selection.
16. The method of claim 11 , comprising messaging between components of the management code.
17. The method of claim 11 , comprising providing management code client module to control each user process.
18. The method of claim 11 , comprising monitoring processes from the host-side.
19. The method of claim 11 , wherein each coprocessor comprise a plurality of X86 cores running Linux, Peripheral Component Interconnect Express Interface, and memory controllers connected with a bidirectional Interprocessor Network (IPN) ring.
20. The method of claim 11 , further comprising:
selecting and scheduling a pending process to a coprocessor with free memory; and
examining offload queues corresponding to each coprocessor, and dispatch an offload to each coprocessor with free threads,
wherein processes and offloads are selected based on an aging-based first-fit heuristic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/858,034 US20140208072A1 (en) | 2013-01-18 | 2013-04-06 | User-level manager to handle multi-processing on many-core coprocessor-based systems |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361754371P | 2013-01-18 | 2013-01-18 | |
US201361761969P | 2013-02-07 | 2013-02-07 | |
US201361761985P | 2013-02-07 | 2013-02-07 | |
US13/858,034 US20140208072A1 (en) | 2013-01-18 | 2013-04-06 | User-level manager to handle multi-processing on many-core coprocessor-based systems |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140208072A1 true US20140208072A1 (en) | 2014-07-24 |
Family
ID=51208692
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/858,036 Active 2033-12-31 US9086925B2 (en) | 2013-01-18 | 2013-04-06 | Methods of processing core selection for applications on manycore processors |
US13/858,039 Active 2034-01-10 US9152467B2 (en) | 2013-01-18 | 2013-04-06 | Method for simultaneous scheduling of processes and offloading computation on many-core coprocessors |
US13/858,034 Abandoned US20140208072A1 (en) | 2013-01-18 | 2013-04-06 | User-level manager to handle multi-processing on many-core coprocessor-based systems |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/858,036 Active 2033-12-31 US9086925B2 (en) | 2013-01-18 | 2013-04-06 | Methods of processing core selection for applications on manycore processors |
US13/858,039 Active 2034-01-10 US9152467B2 (en) | 2013-01-18 | 2013-04-06 | Method for simultaneous scheduling of processes and offloading computation on many-core coprocessors |
Country Status (1)
Country | Link |
---|---|
US (3) | US9086925B2 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150089468A1 (en) * | 2013-09-20 | 2015-03-26 | Cray Inc. | Assisting parallelization of a computer program |
CN107273542A (en) * | 2017-07-06 | 2017-10-20 | 华泰证券股份有限公司 | High concurrent method of data synchronization and system |
CN107431953A (en) * | 2015-03-10 | 2017-12-01 | 华为技术有限公司 | The method and apparatus of Business Stream shunting |
US10203747B2 (en) | 2016-03-22 | 2019-02-12 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Workload placement based on heterogeneous compute performance per watt |
CN111274015A (en) * | 2016-08-31 | 2020-06-12 | 华为技术有限公司 | Configuration method and device and data processing server |
US10698737B2 (en) * | 2018-04-26 | 2020-06-30 | Hewlett Packard Enterprise Development Lp | Interoperable neural network operation scheduler |
US10860499B2 (en) | 2016-03-22 | 2020-12-08 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd | Dynamic memory management in workload acceleration |
US10884761B2 (en) | 2016-03-22 | 2021-01-05 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd | Best performance delivery in heterogeneous computing unit environment |
WO2022111466A1 (en) * | 2020-11-24 | 2022-06-02 | 北京灵汐科技有限公司 | Task scheduling method, control method, electronic device and computer-readable medium |
Families Citing this family (75)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9158592B2 (en) * | 2011-05-02 | 2015-10-13 | Green Hills Software, Inc. | System and method for time variant scheduling of affinity groups comprising processor core and address spaces on a synchronized multicore processor |
GB2507229B (en) * | 2011-07-26 | 2014-09-10 | Ibm | Managing workloads in a multiprocessing computer system |
US10534606B2 (en) | 2011-12-08 | 2020-01-14 | Oracle International Corporation | Run-length encoding decompression |
EP2767905A1 (en) * | 2013-02-15 | 2014-08-20 | Samsung Electronics Co., Ltd | Terminal apparatus, server, browser of terminal apparatus operating system and method of operating browser |
US8997073B2 (en) * | 2013-04-25 | 2015-03-31 | Nec Laboratories America, Inc. | Semi-automatic restructuring of offloadable tasks for accelerators |
US20140331014A1 (en) * | 2013-05-01 | 2014-11-06 | Silicon Graphics International Corp. | Scalable Matrix Multiplication in a Shared Memory System |
JP6163931B2 (en) * | 2013-07-18 | 2017-07-19 | 富士通株式会社 | Information acquisition program, information acquisition method, and information acquisition apparatus |
US11113054B2 (en) | 2013-09-10 | 2021-09-07 | Oracle International Corporation | Efficient hardware instructions for single instruction multiple data processors: fast fixed-length value compression |
CN104995603A (en) * | 2013-11-14 | 2015-10-21 | 联发科技股份有限公司 | Task scheduling method and related non-transitory computer readable medium for dispatching task in multi-core processor system based at least partly on distribution of tasks sharing same data and/or accessing same memory address (ES) |
JP6252140B2 (en) * | 2013-11-29 | 2017-12-27 | 富士通株式会社 | Task allocation program and task allocation method |
US9589311B2 (en) * | 2013-12-18 | 2017-03-07 | Intel Corporation | Independent thread saturation of graphics processing units |
IN2014MU00454A (en) * | 2014-02-07 | 2015-09-25 | Tata Consultancy Services Ltd | |
US9183050B2 (en) * | 2014-02-28 | 2015-11-10 | Tata Consultancy Services Limited | System and method for determining total processing time for executing a plurality of jobs |
US20150256645A1 (en) * | 2014-03-10 | 2015-09-10 | Riverscale Ltd | Software Enabled Network Storage Accelerator (SENSA) - Network Server With Dedicated Co-processor Hardware Implementation of Storage Target Application |
KR102387932B1 (en) * | 2014-07-31 | 2022-04-15 | 삼성전자주식회사 | A METHOD TO PROVIDE FIXED QUALITY OF SERVICE TO HOST IO COMMANDS IN MULTI-PORT, MULTI-FUNCTION PCIe BASED STORAGE DEVICE |
US9832081B2 (en) | 2014-09-29 | 2017-11-28 | International Business Machines Corporation | Allocating physical nodes for processes in an execution plan |
US9787761B2 (en) * | 2014-09-29 | 2017-10-10 | International Business Machines Corporation | Allocating physical nodes for processes in an execution plan |
US9442760B2 (en) * | 2014-10-03 | 2016-09-13 | Microsoft Technology Licensing, Llc | Job scheduling using expected server performance information |
CN104375838B (en) * | 2014-11-27 | 2017-06-06 | 浪潮电子信息产业股份有限公司 | It is a kind of based on OpenMP to the optimization method of astronomy software Gridding |
US20160170767A1 (en) * | 2014-12-12 | 2016-06-16 | Intel Corporation | Temporary transfer of a multithreaded ip core to single or reduced thread configuration during thread offload to co-processor |
US10768984B2 (en) | 2015-06-11 | 2020-09-08 | Honeywell International Inc. | Systems and methods for scheduling tasks using sliding time windows |
CN104899007B (en) * | 2015-06-15 | 2017-08-01 | 华中科技大学 | The system and method for the grand filter process performance of cloth is lifted using Xeon Phi coprocessors |
US10191768B2 (en) | 2015-09-16 | 2019-01-29 | Salesforce.Com, Inc. | Providing strong ordering in multi-stage streaming processing |
US10198298B2 (en) * | 2015-09-16 | 2019-02-05 | Salesforce.Com, Inc. | Handling multiple task sequences in a stream processing framework |
US10146592B2 (en) | 2015-09-18 | 2018-12-04 | Salesforce.Com, Inc. | Managing resource allocation in a stream processing framework |
US9965330B2 (en) | 2015-09-18 | 2018-05-08 | Salesforce.Com, Inc. | Maintaining throughput of a stream processing framework while increasing processing load |
WO2017052555A1 (en) * | 2015-09-24 | 2017-03-30 | Hewlett Packard Enterprise Development Lp | Process and thread launch features |
US9996393B2 (en) | 2015-11-19 | 2018-06-12 | International Business Machines Corporation | Dynamic virtual processor manager |
US10437635B2 (en) | 2016-02-10 | 2019-10-08 | Salesforce.Com, Inc. | Throttling events in entity lifecycle management |
US10402425B2 (en) | 2016-03-18 | 2019-09-03 | Oracle International Corporation | Tuple encoding aware direct memory access engine for scratchpad enabled multi-core processors |
KR102464678B1 (en) * | 2016-03-18 | 2022-11-11 | 한국전자통신연구원 | Method and apparatus for scheduling thread in a in manycore system |
CN105868025B (en) * | 2016-03-30 | 2019-05-10 | 华中科技大学 | A kind of system solving memory source keen competition in big data processing system |
US10599488B2 (en) * | 2016-06-29 | 2020-03-24 | Oracle International Corporation | Multi-purpose events for notification and sequence control in multi-core processor systems |
CN107656806A (en) * | 2016-07-25 | 2018-02-02 | 华为技术有限公司 | A kind of resource allocation methods and resource allocation device |
US10146583B2 (en) * | 2016-08-11 | 2018-12-04 | Samsung Electronics Co., Ltd. | System and method for dynamically managing compute and I/O resources in data processing systems |
US10380058B2 (en) | 2016-09-06 | 2019-08-13 | Oracle International Corporation | Processor core to coprocessor interface with FIFO semantics |
US10289448B2 (en) * | 2016-09-06 | 2019-05-14 | At&T Intellectual Property I, L.P. | Background traffic management |
US10860373B2 (en) * | 2016-10-11 | 2020-12-08 | Microsoft Technology Licensing, Llc | Enhanced governance for asynchronous compute jobs |
US10783102B2 (en) | 2016-10-11 | 2020-09-22 | Oracle International Corporation | Dynamically configurable high performance database-aware hash engine |
US10620983B2 (en) | 2016-11-08 | 2020-04-14 | International Business Machines Corporation | Memory stripe with selectable size |
US10235202B2 (en) * | 2016-11-08 | 2019-03-19 | International Business Machines Corporation | Thread interrupt offload re-prioritization |
US10176114B2 (en) | 2016-11-28 | 2019-01-08 | Oracle International Corporation | Row identification number generation in database direct memory access engine |
US10459859B2 (en) | 2016-11-28 | 2019-10-29 | Oracle International Corporation | Multicast copy ring for database direct memory access filtering engine |
US10725947B2 (en) | 2016-11-29 | 2020-07-28 | Oracle International Corporation | Bit vector gather row count calculation and handling in direct memory access engine |
US11099890B2 (en) * | 2016-12-13 | 2021-08-24 | Intel Corporation | Devices and methods for prioritized resource allocation based on communication channel conditions |
CN108268313A (en) * | 2016-12-30 | 2018-07-10 | 杭州华为数字技术有限公司 | The method and apparatus of data processing |
US11108698B2 (en) * | 2017-02-03 | 2021-08-31 | Microsoft Technology Licensing, Llc | Systems and methods for client-side throttling after server handling in a trusted client component |
US10713089B2 (en) * | 2017-05-20 | 2020-07-14 | Cavium International | Method and apparatus for load balancing of jobs scheduled for processing |
US11055133B2 (en) | 2017-05-26 | 2021-07-06 | Red Hat, Inc. | Node-local-unscheduler for scheduling remediation |
CN110914805A (en) * | 2017-07-12 | 2020-03-24 | 华为技术有限公司 | Computing system for hierarchical task scheduling |
US9946577B1 (en) * | 2017-08-14 | 2018-04-17 | 10X Genomics, Inc. | Systems and methods for distributed resource management |
US10162678B1 (en) | 2017-08-14 | 2018-12-25 | 10X Genomics, Inc. | Systems and methods for distributed resource management |
KR102442921B1 (en) * | 2017-12-11 | 2022-09-13 | 삼성전자주식회사 | Electronic device capable of increasing the task management efficiency of the digital signal processor |
US10789013B2 (en) * | 2018-03-01 | 2020-09-29 | Seagate Technology Llc | Command scheduling for target latency distribution |
JP7386542B2 (en) * | 2018-03-08 | 2023-11-27 | クアドリック.アイオー,インコーポレイテッド | Machine perception and dense algorithm integrated circuits |
US10908955B2 (en) * | 2018-03-22 | 2021-02-02 | Honeywell International Inc. | Systems and methods for variable rate limiting of shared resource access |
CN113723641A (en) * | 2018-07-03 | 2021-11-30 | 创新先进技术有限公司 | Resource scheduling method and device |
RU2731321C2 (en) | 2018-09-14 | 2020-09-01 | Общество С Ограниченной Ответственностью "Яндекс" | Method for determining a potential fault of a storage device |
RU2718215C2 (en) | 2018-09-14 | 2020-03-31 | Общество С Ограниченной Ответственностью "Яндекс" | Data processing system and method for detecting jam in data processing system |
RU2714219C1 (en) | 2018-09-14 | 2020-02-13 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for scheduling transfer of input/output operations |
CN110968415B (en) * | 2018-09-29 | 2022-08-05 | Oppo广东移动通信有限公司 | Scheduling method and device of multi-core processor and terminal |
RU2714602C1 (en) | 2018-10-09 | 2020-02-18 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for data processing |
RU2721235C2 (en) | 2018-10-09 | 2020-05-18 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for routing and execution of transactions |
US11256696B2 (en) * | 2018-10-15 | 2022-02-22 | Ocient Holdings LLC | Data set compression within a database system |
RU2711348C1 (en) | 2018-10-15 | 2020-01-16 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for processing requests in a distributed database |
KR102641520B1 (en) * | 2018-11-09 | 2024-02-28 | 삼성전자주식회사 | System on chip including multi-core processor and task scheduling method thereof |
RU2714373C1 (en) | 2018-12-13 | 2020-02-14 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for scheduling execution of input/output operations |
RU2749649C2 (en) | 2018-12-21 | 2021-06-16 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for scheduling processing of i/o operations |
RU2720951C1 (en) | 2018-12-29 | 2020-05-15 | Общество С Ограниченной Ответственностью "Яндекс" | Method and distributed computer system for data processing |
RU2746042C1 (en) | 2019-02-06 | 2021-04-06 | Общество С Ограниченной Ответственностью "Яндекс" | Method and the system for message transmission |
US11269661B2 (en) * | 2019-03-04 | 2022-03-08 | Micron Technology, Inc. | Providing, in a configuration packet, data indicative of data flows in a processor with a data flow manager |
CN110061862B (en) * | 2019-03-25 | 2022-02-18 | 浙江理工大学 | Fairness-based distributed multi-task crowd sensing method in dense network |
US11567796B2 (en) | 2020-10-22 | 2023-01-31 | International Business Machines Corporation | Configuring hardware multithreading in containers |
US11803416B1 (en) * | 2020-12-16 | 2023-10-31 | Wells Fargo Bank, N.A. | Selection and management of devices for performing tasks within a distributed computing system |
US11645113B2 (en) * | 2021-04-30 | 2023-05-09 | Hewlett Packard Enterprise Development Lp | Work scheduling on candidate collections of processing units selected according to a criterion |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5191652A (en) * | 1989-11-10 | 1993-03-02 | International Business Machines Corporation | Method and apparatus for exploiting communications bandwidth as for providing shared memory |
US5826081A (en) * | 1996-05-06 | 1998-10-20 | Sun Microsystems, Inc. | Real time thread dispatcher for multiprocessor applications |
US20080126486A1 (en) * | 2006-09-15 | 2008-05-29 | Bea Systems, Inc. | Personal messaging application programming interface for integrating an application with groupware systems |
US20120158967A1 (en) * | 2010-12-21 | 2012-06-21 | Sedayao Jeffrey C | Virtual core abstraction for cloud computing |
Family Cites Families (67)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5872972A (en) * | 1996-07-05 | 1999-02-16 | Ncr Corporation | Method for load balancing a per processor affinity scheduler wherein processes are strictly affinitized to processors and the migration of a process from an affinitized processor to another available processor is limited |
US6385638B1 (en) * | 1997-09-04 | 2002-05-07 | Equator Technologies, Inc. | Processor resource distributor and method |
US7149795B2 (en) * | 2000-09-18 | 2006-12-12 | Converged Access, Inc. | Distributed quality-of-service system |
US7007156B2 (en) * | 2000-12-28 | 2006-02-28 | Intel Corporation | Multiple coprocessor architecture to process a plurality of subtasks in parallel |
US6839808B2 (en) * | 2001-07-06 | 2005-01-04 | Juniper Networks, Inc. | Processing cluster having multiple compute engines and shared tier one caches |
US8024395B1 (en) * | 2001-09-04 | 2011-09-20 | Gary Odom | Distributed processing multiple tier task allocation |
US8005978B1 (en) * | 2002-03-01 | 2011-08-23 | Cisco Technology, Inc. | Method to optimize the load balancing of parallel coprocessors |
US7143412B2 (en) * | 2002-07-25 | 2006-11-28 | Hewlett-Packard Development Company, L.P. | Method and apparatus for optimizing performance in a multi-processing system |
US7287254B2 (en) * | 2002-07-30 | 2007-10-23 | Unisys Corporation | Affinitizing threads in a multiprocessor system |
US7389506B1 (en) * | 2002-07-30 | 2008-06-17 | Unisys Corporation | Selecting processor configuration based on thread usage in a multiprocessor system |
AU2003300948A1 (en) * | 2002-12-16 | 2004-07-22 | Globespanvirata Incorporated | System and method for scheduling thread execution |
JP4597488B2 (en) * | 2003-03-31 | 2010-12-15 | 株式会社日立製作所 | Program placement method, execution system thereof, and processing program thereof |
FR2854263A1 (en) * | 2003-04-24 | 2004-10-29 | St Microelectronics Sa | METHOD FOR PERFORMING COMPETITIVE TASKS BY A SUBSYSTEM MANAGED BY A CENTRAL PROCESSOR |
JP4028444B2 (en) * | 2003-06-27 | 2007-12-26 | 株式会社東芝 | Scheduling method and real-time processing system |
US7369500B1 (en) * | 2003-06-30 | 2008-05-06 | Juniper Networks, Inc. | Dynamic queue threshold extensions to random early detection |
US7239581B2 (en) * | 2004-08-24 | 2007-07-03 | Symantec Operating Corporation | Systems and methods for synchronizing the internal clocks of a plurality of processor modules |
US20050108713A1 (en) * | 2003-11-18 | 2005-05-19 | Geye Scott A. | Affinity mask assignment system and method for multiprocessor systems |
US7137033B2 (en) * | 2003-11-20 | 2006-11-14 | International Business Machines Corporation | Method, system, and program for synchronizing subtasks using sequence numbers |
US7810099B2 (en) * | 2004-06-17 | 2010-10-05 | International Business Machines Corporation | Optimizing workflow execution against a heterogeneous grid computing topology |
US20060167966A1 (en) * | 2004-12-09 | 2006-07-27 | Rajendra Kumar | Grid computing system having node scheduler |
US8756605B2 (en) * | 2004-12-17 | 2014-06-17 | Oracle America, Inc. | Method and apparatus for scheduling multiple threads for execution in a shared microprocessor pipeline |
US8051418B1 (en) * | 2005-03-21 | 2011-11-01 | Oracle America, Inc. | Techniques for providing improved affinity scheduling in a multiprocessor computer system |
US7406689B2 (en) * | 2005-03-22 | 2008-07-29 | International Business Machines Corporation | Jobstream planner considering network contention & resource availability |
EP1715405A1 (en) * | 2005-04-19 | 2006-10-25 | STMicroelectronics S.r.l. | Processing method, system and computer program product for dynamic allocation of processing tasks in a multiprocessor cluster platforms with power adjustment |
US8156500B2 (en) * | 2005-07-01 | 2012-04-10 | Microsoft Corporation | Real-time self tuning of planned actions in a distributed environment |
US9015501B2 (en) * | 2006-07-13 | 2015-04-21 | International Business Machines Corporation | Structure for asymmetrical performance multi-processors |
US7730119B2 (en) * | 2006-07-21 | 2010-06-01 | Sony Computer Entertainment Inc. | Sub-task processor distribution scheduling |
US20080263324A1 (en) * | 2006-08-10 | 2008-10-23 | Sehat Sutardja | Dynamic core switching |
US7941805B2 (en) * | 2006-08-15 | 2011-05-10 | International Business Machines Corporation | Affinity dispatching load balancer with precise CPU consumption data |
US8429656B1 (en) * | 2006-11-02 | 2013-04-23 | Nvidia Corporation | Thread count throttling for efficient resource utilization |
US20080195843A1 (en) * | 2007-02-08 | 2008-08-14 | Jaya 3D Llc | Method and system for processing a volume visualization dataset |
US8327363B2 (en) * | 2007-07-24 | 2012-12-04 | Microsoft Corporation | Application compatibility in multi-core systems |
US8544014B2 (en) * | 2007-07-24 | 2013-09-24 | Microsoft Corporation | Scheduling threads in multi-core systems |
US8250254B2 (en) * | 2007-07-31 | 2012-08-21 | Intel Corporation | Offloading input/output (I/O) virtualization operations to a processor |
US8397236B2 (en) * | 2007-08-24 | 2013-03-12 | Virtualmetrix, Inc. | Credit based performance managment of computer systems |
JP5182792B2 (en) * | 2007-10-07 | 2013-04-17 | アルパイン株式会社 | Multi-core processor control method and apparatus |
US7996346B2 (en) * | 2007-12-19 | 2011-08-09 | International Business Machines Corporation | Method for autonomic workload distribution on a multicore processor |
US8739165B2 (en) * | 2008-01-22 | 2014-05-27 | Freescale Semiconductor, Inc. | Shared resource based thread scheduling with affinity and/or selectable criteria |
US8539499B1 (en) * | 2008-02-18 | 2013-09-17 | Parallels IP Holdings GmbH | Symmetric multiprocessing with virtual CPU and VSMP technology |
US8561073B2 (en) * | 2008-09-19 | 2013-10-15 | Microsoft Corporation | Managing thread affinity on multi-core processors |
US9703595B2 (en) * | 2008-10-02 | 2017-07-11 | Mindspeed Technologies, Llc | Multi-core system with central transaction control |
US8069446B2 (en) * | 2009-04-03 | 2011-11-29 | Microsoft Corporation | Parallel programming and execution systems and techniques |
US8255554B2 (en) * | 2009-05-14 | 2012-08-28 | International Business Machines Corporation | Application resource model composition from constituent components |
US20110022870A1 (en) * | 2009-07-21 | 2011-01-27 | Microsoft Corporation | Component power monitoring and workload optimization |
US8484647B2 (en) * | 2009-07-24 | 2013-07-09 | Apple Inc. | Selectively adjusting CPU wait mode based on estimation of remaining work before task completion on GPU |
US8645963B2 (en) * | 2009-11-05 | 2014-02-04 | International Business Machines Corporation | Clustering threads based on contention patterns |
US8832403B2 (en) * | 2009-11-13 | 2014-09-09 | International Business Machines Corporation | Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses |
US8336056B1 (en) * | 2009-12-22 | 2012-12-18 | Gadir Omar M A | Multi-threaded system for data management |
US20120227045A1 (en) * | 2009-12-26 | 2012-09-06 | Knauth Laura A | Method, apparatus, and system for speculative execution event counter checkpointing and restoring |
US8484279B1 (en) * | 2010-01-29 | 2013-07-09 | Sprint Communications Company L.P. | System and method of distributed computing using embedded processors |
US8782653B2 (en) * | 2010-03-26 | 2014-07-15 | Virtualmetrix, Inc. | Fine grain performance resource management of computer systems |
US8285950B2 (en) * | 2010-06-03 | 2012-10-09 | International Business Machines Corporation | SMT/ECO mode based on cache miss rate |
JP5655403B2 (en) * | 2010-07-13 | 2015-01-21 | 富士通株式会社 | Multi-core processor system, schedule management program, and computer-readable recording medium recording the program |
US9239996B2 (en) * | 2010-08-24 | 2016-01-19 | Solano Labs, Inc. | Method and apparatus for clearing cloud compute demand |
US8914805B2 (en) * | 2010-08-31 | 2014-12-16 | International Business Machines Corporation | Rescheduling workload in a hybrid computing environment |
US8893133B2 (en) * | 2010-09-01 | 2014-11-18 | International Business Machines Corporation | Dynamic test scheduling by ordering tasks for performance based on similarities between the tasks |
US8621477B2 (en) * | 2010-10-29 | 2013-12-31 | International Business Machines Corporation | Real-time monitoring of job resource consumption and prediction of resource deficiency based on future availability |
US8601201B2 (en) * | 2010-11-09 | 2013-12-03 | Gridcentric Inc. | Managing memory across a network of cloned virtual machines |
US20120204008A1 (en) * | 2011-02-04 | 2012-08-09 | Qualcomm Incorporated | Processor with a Hybrid Instruction Queue with Instruction Elaboration Between Sections |
WO2012127641A1 (en) * | 2011-03-23 | 2012-09-27 | 株式会社日立製作所 | Information processing system |
US9158592B2 (en) * | 2011-05-02 | 2015-10-13 | Green Hills Software, Inc. | System and method for time variant scheduling of affinity groups comprising processor core and address spaces on a synchronized multicore processor |
US20140317389A1 (en) * | 2011-11-18 | 2014-10-23 | The Trustees Of The University Of Pennsylvania | Computational sprinting using multiple cores |
US9158587B2 (en) * | 2012-01-19 | 2015-10-13 | International Business Machines Corporation | Flexible task and thread binding with preferred processors based on thread layout |
US8924754B2 (en) * | 2012-02-02 | 2014-12-30 | Empire Technology Development Llc | Quality of service targets in multicore processors |
US8775762B2 (en) * | 2012-05-07 | 2014-07-08 | Advanced Micro Devices, Inc. | Method and apparatus for batching memory requests |
US8869157B2 (en) * | 2012-06-21 | 2014-10-21 | Breakingpoint Systems, Inc. | Systems and methods for distributing tasks and/or processing recources in a system |
US8874754B2 (en) * | 2012-10-16 | 2014-10-28 | Softwin Srl Romania | Load balancing in handwritten signature authentication systems |
-
2013
- 2013-04-06 US US13/858,036 patent/US9086925B2/en active Active
- 2013-04-06 US US13/858,039 patent/US9152467B2/en active Active
- 2013-04-06 US US13/858,034 patent/US20140208072A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5191652A (en) * | 1989-11-10 | 1993-03-02 | International Business Machines Corporation | Method and apparatus for exploiting communications bandwidth as for providing shared memory |
US5826081A (en) * | 1996-05-06 | 1998-10-20 | Sun Microsystems, Inc. | Real time thread dispatcher for multiprocessor applications |
US20080126486A1 (en) * | 2006-09-15 | 2008-05-29 | Bea Systems, Inc. | Personal messaging application programming interface for integrating an application with groupware systems |
US20120158967A1 (en) * | 2010-12-21 | 2012-06-21 | Sedayao Jeffrey C | Virtual core abstraction for cloud computing |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150089468A1 (en) * | 2013-09-20 | 2015-03-26 | Cray Inc. | Assisting parallelization of a computer program |
US9250877B2 (en) * | 2013-09-20 | 2016-02-02 | Cray Inc. | Assisting parallelization of a computer program |
US10761820B2 (en) | 2013-09-20 | 2020-09-01 | Cray, Inc. | Assisting parallelization of a computer program |
CN107431953A (en) * | 2015-03-10 | 2017-12-01 | 华为技术有限公司 | The method and apparatus of Business Stream shunting |
US10484910B2 (en) | 2015-03-10 | 2019-11-19 | Huawei Technologies Co., Ltd. | Traffic flow splitting method and apparatus |
US10203747B2 (en) | 2016-03-22 | 2019-02-12 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Workload placement based on heterogeneous compute performance per watt |
US10860499B2 (en) | 2016-03-22 | 2020-12-08 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd | Dynamic memory management in workload acceleration |
US10884761B2 (en) | 2016-03-22 | 2021-01-05 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd | Best performance delivery in heterogeneous computing unit environment |
CN111274015A (en) * | 2016-08-31 | 2020-06-12 | 华为技术有限公司 | Configuration method and device and data processing server |
CN107273542A (en) * | 2017-07-06 | 2017-10-20 | 华泰证券股份有限公司 | High concurrent method of data synchronization and system |
US10698737B2 (en) * | 2018-04-26 | 2020-06-30 | Hewlett Packard Enterprise Development Lp | Interoperable neural network operation scheduler |
WO2022111466A1 (en) * | 2020-11-24 | 2022-06-02 | 北京灵汐科技有限公司 | Task scheduling method, control method, electronic device and computer-readable medium |
Also Published As
Publication number | Publication date |
---|---|
US20140208327A1 (en) | 2014-07-24 |
US20140208331A1 (en) | 2014-07-24 |
US9086925B2 (en) | 2015-07-21 |
US9152467B2 (en) | 2015-10-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20140208072A1 (en) | User-level manager to handle multi-processing on many-core coprocessor-based systems | |
US11797327B2 (en) | Dynamic virtual machine sizing | |
US10467725B2 (en) | Managing access to a resource pool of graphics processing units under fine grain control | |
US9367357B2 (en) | Simultaneous scheduling of processes and offloading computation on many-core coprocessors | |
US11010053B2 (en) | Memory-access-resource management | |
US8739171B2 (en) | High-throughput-computing in a hybrid computing environment | |
US8914805B2 (en) | Rescheduling workload in a hybrid computing environment | |
Becchi et al. | A virtual memory based runtime to support multi-tenancy in clusters with GPUs | |
US9063783B2 (en) | Coordinating parallel execution of processes using agents | |
US8539499B1 (en) | Symmetric multiprocessing with virtual CPU and VSMP technology | |
US20100325637A1 (en) | Allocation of resources to a scheduler in a process | |
US20080229319A1 (en) | Global Resource Allocation Control | |
Cadambi et al. | COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors | |
US20090133029A1 (en) | Methods and systems for transparent stateful preemption of software system | |
Wu et al. | Transparent {GPU} sharing in container clouds for deep learning workloads | |
US20080134187A1 (en) | Hardware scheduled smp architectures | |
Cheng et al. | Dynamic resource provisioning for iterative workloads on Apache Spark | |
Sajjapongse et al. | A flexible scheduling framework for heterogeneous CPU-GPU clusters | |
US8402191B2 (en) | Computing element virtualization | |
Bashizade et al. | Adaptive simultaneous multi-tenancy for gpus | |
US9378062B2 (en) | Interface between a resource manager and a scheduler in a process | |
KR101334842B1 (en) | Virtual machine manager for platform of terminal having function of virtualization and method thereof | |
Yu et al. | A multicore periodical preemption virtual machine scheduling scheme to improve the performance of computational tasks | |
Weinhold et al. | FFMK: an HPC OS based on the L4Re Microkernel | |
Su et al. | A zero-penalty container-based execution infrastructure for hadoop framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NEC LABORATORIES AMERICA, INC., NEW JERSEY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CADAMBI, SRIHARI;COVIELLO, GIUSEPPE;LI, CHENG-HONG;AND OTHERS;SIGNING DATES FROM 20130406 TO 20130408;REEL/FRAME:030493/0364 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |