US20160034313A1 - Empirical determination of adapter affinity in high performance computing (hpc) environment - Google Patents
Empirical determination of adapter affinity in high performance computing (hpc) environment Download PDFInfo
- Publication number
- US20160034313A1 US20160034313A1 US14/530,095 US201414530095A US2016034313A1 US 20160034313 A1 US20160034313 A1 US 20160034313A1 US 201414530095 A US201414530095 A US 201414530095A US 2016034313 A1 US2016034313 A1 US 2016034313A1
- Authority
- US
- United States
- Prior art keywords
- adapter
- location
- task
- mcm
- distributed computing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5033—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G06F11/3433—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment for load management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/461—Saving or restoring of program or task context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/526—Mutual exclusion algorithms
- G06F9/528—Mutual exclusion algorithms by using speculative mechanisms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/501—Performance criteria
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/502—Proximity
Definitions
- the invention is generally related to computers and computer software, and in particular, to high performance computing (HPC) environments.
- HPC high performance computing
- Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost.
- HPC extremely high performance computing
- One particular type of computing system architecture that is often used in high performance applications is a parallel processing computing system.
- a parallel processing computing system comprises a plurality of physical computing nodes and is configured with an HPC application environment, e.g., including a runtime environment that supports the execution of a parallel application across multiple physical computing nodes.
- HPC application environment e.g., including a runtime environment that supports the execution of a parallel application across multiple physical computing nodes.
- Some parallel processing computing systems which may also be referred to as massively parallel processing computing systems, may have hundreds or thousands of individual physical computing nodes, and provide supercomputer class performance.
- Each physical computing node is typically of relatively modest computing power and generally includes one or more processors and a set of dedicated memory devices, and is configured with an operating system instance (OSI), as well as components defining a software stack for the runtime environment.
- OSI operating system instance
- To execute a parallel application a cluster is generally created consisting of physical computing nodes, and one or more parallel tasks are executed within an OSI in each physical computing node and using the runtime environment such that tasks may be executed in parallel across all physical computing no
- Performance in parallel processing computing systems can be dependent upon the communication costs associated with communicating data between the components in such systems.
- Accessing a memory directly coupled to a processor in one physical computing node may be one or more orders of magnitude faster than accessing a memory on different physical computing node.
- retaining the data within a processor and/or directly coupled memory when a processor switches between different tasks can avoid having to reload the data.
- organizing the tasks executed in a parallel processing computing system to localize operations and data and minimize the latency associated with communicating data between components can have an appreciable impact on performance.
- tasks can be assigned or bound to particular processors or physical nodes using a concept commonly referred to as affinity such that the tasks will be scheduled for execution if at all possible on the processors or physical nodes to which such tasks have an affinity.
- parallel processing computing systems may support multiple input/output (IO) adapters, e.g., network adapters for communication of data over a network.
- IO input/output
- distributing network adapters in this manner may result in variations in latency and bandwidth for tasks accessing such network adapters based upon where the tasks are executed relative to where the network adapters are located. Accordingly, tasks may also be assigned or bound to particular network adapters in a system based upon adapter affinity.
- the invention addresses these and other problems associated with the prior art by providing a method, apparatus and program product that utilize an empirical approach to determine the locations of one or more IO adapters in an HPC environment. Performance tests may be run using a plurality of candidate mappings that map IO adapters to various locations in the HPC environment, and based upon the results of such testing, speculative adapter affinity information may be generated that assigns one or more IO adapters to one or more locations to optimize adapter affinity performance for subsequently-executed tasks.
- adapter affinity may be determined in a high performance computing (HPC) environment of the type including a plurality of distributed computing components defining a plurality of locations and a plurality of input/output (IO) adapters, with each distributed computing component including at least one processing element, and with each IO adapter coupled to a distributed computing component among the plurality of distributed computing components.
- HPC high performance computing
- a performance test may be run for a task executed by a processing element in a distributed computing component at a first location among the plurality of locations, where the plurality of candidate mappings includes first and second candidate mappings, where the first candidate mapping maps a first IO adapter among the plurality of IO adapters to the first location, and the second candidate mapping maps a second IO adapter among the plurality of IO adapters to the first location, and where running the performance test respectively generates first and second test results for the first and second candidate mappings.
- Speculative adapter affinity information may be generated that assigns at least one IO adapter to at least one location among the plurality of locations based upon the performance test run for each of the plurality of candidate mappings, including assigning the first IO adapter to the first location based upon a comparison of the first and second test results for the first and second candidate mappings.
- FIG. 1 is a block diagram of an example hardware and software environment suitable for empirically determining adapter affinity in a manner consistent with the invention.
- FIG. 2 is a block diagram of an example adapter affinity determination operation consistent with the invention.
- FIG. 3 is a flowchart illustrating an example sequence of operations for initializing a task in the HPC environment of FIG. 1 .
- FIG. 4 is a flowchart illustrating an example sequence of operations for empirically generating a mapping in the HPC environment of FIG. 1 .
- Embodiments consistent with the invention utilize an empirical approach to determine the locations of one or more IO adapters in an HPC environment. Performance tests may be run using a plurality of candidate mappings that map IO adapters to various locations in the HPC environment, and based upon the results of such testing, speculative adapter affinity information may be generated that assigns one or more IO adapters to one or more locations to optimize adapter affinity performance for subsequently-executed tasks.
- an HPC environment consistent with the invention may be considered to include a hardware and/or software environment suitable for hosting an HPC application, generally implemented using a plurality of parallel tasks.
- an HPC environment includes a plurality of distributing computing components, organized in one or more hierarchical levels, and supporting the concurrent execution of a plurality of hardware threads of execution.
- an HPC application may be implemented using hundreds, thousands, or more parallel tasks running on hundreds, thousands, or more hardware threads of execution.
- FIG. 1 illustrates the principal hardware and software components in an apparatus 50 capable of implementing an HPC environment consistent with the invention.
- Apparatus 50 is illustrated as an HPC system incorporating a plurality of physical computing nodes 52 coupled to one another over a cluster network 54 , and including a plurality of processors 56 coupled to a plurality of memory devices 58 representing the computational and memory resources of the HPC system.
- Apparatus 50 may be implemented using any of a number of different architectures suitable for executing HPC applications, e.g., a supercomputer architecture.
- apparatus 50 may be implemented as a Power7 IH-based system available from International Business Machines Corporation.
- processors 56 and memory devices 58 may be disposed on multi-chip modules 60 , e.g., quad chip modules (QCM's), which in turn may be disposed within a physical computing node 52 along with a hub chip 64 that provides access to one or more input/output (I/O) adapters 66 , which may be used to access network, storage and other external resources.
- QCM's quad chip modules
- I/O input/output
- modules 62 may be organized together into modules 62 , e.g., rack modules or drawers, and physical computing nodes may be further organized into supernodes, cabinets, data centers, etc.
- modules 62 e.g., rack modules or drawers
- physical computing nodes may be further organized into supernodes, cabinets, data centers, etc.
- other architectures suitable for executing HPC applications may be used, e.g., any of the Blue Gene/L, Blue Gene/P, and Blue Gene/Q architectures available from International Business Machines Corporation, among others. Therefore, the invention is not limited to use with the Power7 IH architecture disclosed herein.
- Each processor 56 may be implemented as a single or multi-threaded processor and/or as a single or multi-core processor, while each memory 58 may be considered to include one or more levels of memory devices, e.g., a DRAM-based main storage, as well as one or more levels of data, instruction and/or combination caches, with certain caches either serving individual processors or multiple processors as is well known in the art.
- the memory of apparatus 50 may be considered to include memory storage physically located elsewhere in apparatus 50 , e.g., any cache memory in a processor, as well as any storage capacity used as a virtual memory, e.g., as stored on a mass storage device or on another computer coupled to apparatus 50 .
- Apparatus 50 operates under the control of one or more kernels, hypervisors, operating systems, etc., and executes or otherwise relies upon various computer software applications, components, programs, objects, modules, data structures, etc., as will be described in greater detail below. Moreover, various applications, components, programs, objects, modules, etc. may also execute on one or more processors in another computer coupled to apparatus 50 via network, e.g., in a distributed or client-server computing environment, whereby the processing required to implement the functions of a computer program may be allocated to multiple computers over a network.
- FIG. 1 illustrates various software components 70 - 76 forming a software stack that may be resident within the memories 58 in an MCM 60 .
- a hypervisor 70 may host one or more operating system instances 72 , within which may reside one or more tasks 74 .
- Additional components e.g., a job management component 76 , parallel operating environment (POE) or other load balancing functionality, etc., may further support the execution of parallel tasks and jobs in apparatus 50 .
- POE parallel operating environment
- additional and/or alternate components may be supported in a software stack for an HPC environment, and that components may be replicated and/or distributed among the various memories, MCM's, nodes, etc. in apparatus 50 .
- an IBM Power HPC software stack available from International Business Machines Corporation may be used, although the invention is not so limited.
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing one or more processors to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- program code computer readable program instructions, of which one or more may collectively be referred to herein as “program code,” may be identified herein based upon the application within which such instructions are implemented in a specific embodiment of the invention.
- program code any particular program nomenclature that follows is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
- FIG. 1 is not intended to limit the present invention. Indeed, those skilled in the art will recognize that other alternative hardware and/or software environments may be used without departing from the scope of the invention.
- Embodiments consistent with the invention are directed in part to an empirical approach for determining adapter affinity in an HPC environment, e.g., in a parallel processing computing system incorporating a plurality of input/output (IO) adapters such as network adapters, storage adapters, Host Channel Adapters (HCA's), etc.
- IO input/output
- Embodiments consistent with the invention determine adapter affinity information using an empirical approach to attempt to make a best guess as to which distributed computing component among a plurality of computing components an IO adapter is coupled using performance resulting from one or more performance tests or experiments. The information collected may then be used later as an input to future jobs or tasks to improve performance when executing those jobs or tasks.
- the HPC environment is a Power7-based HPC environment in which computing resources are organized in a multi-level hierarchy where one or more processing threads are implemented within one or more processing cores on one or more processors, and where one or more processors are disposed on one or more multi-chip modules (MCM's).
- MCM's in turn are organized into physical computing nodes (also referred to as octants), which in turn may be organized into modules, e.g., rack modules or drawers, supernodes, cabinets, data centers, etc.
- modules e.g., rack modules or drawers, supernodes, cabinets, data centers, etc.
- individual MCM's incorporate one or more PCIe slots for interfacing with one or more IO adapters.
- the processors are packaged extremely close together with integrated caches and buses to enable fast data transfer and reliability, such that each MCM creates a substantially complete physical package for communications purposes.
- the parallel communication stack in such an environment also generally allows users to specify a parallel task to be scheduled to run on a set of processing elements, which within the context of the invention may be considered to be one or more hardware threads of execution on a processor or processing core, one or more processing cores on a processor, one or more processors on an MCM or spread across multiple or all MCM's in the HPC environment, or any combination of same.
- a processing element may be considered to be a hardware thread of execution.
- a task is also generally allocated with a fixed set of adapter resource units, identified by an IO adapter's ID and its logical port number, to send and receive data.
- the concept of adapter affinity performance within the context of this environment is generally that when a task is scheduled on a particular processing element, e.g., a hardware thread of execution, processing core, processor, MCM, or other level of the hierarchy in a multi-level processing architecture, performance is optimized when the task selects or is otherwise assigned an adapter resource, allocated and available to the task, that is “closest” to the processing element(s) that it was scheduled to run on to achieve optimal bandwidth and latency performance due to the distance and capacity the data has to be moved.
- a processing element e.g., a hardware thread of execution, processing core, processor, MCM, or other level of the hierarchy in a multi-level processing architecture
- an adapter resource is closest when it is plugged into the PCIe slot attached to the same MCM that contains the scheduled processing element(s).
- selection of an adapter resource to optimize affinity performance for a task incorporates an awareness of the “locations” of both the processing element(s) upon which a task is scheduled for execution and the adapter resource(s) that may be utilized by the task.
- These locations may be considered within an overall hierarchy of a plurality of distributed computing components in an HPC environment.
- the locations of interest from the perspective of adapter affinity performance within the hierarchy of distributed computing components are generally defined at the MCM level in the hierarchy, as it is to a particular MCM that an IO adapter is generally coupled that determines adapter affinity performance in this environment.
- the locations of interest may be defined at different levels in the hierarchy, e.g., at the processor level, the core level, the node level, the supernode level, the cabinet level, or any other level in a multi-level hierarchy where communication latency differs between distributed computing components and adapter resources in the same location and distributed computing components and adapter resources in different locations. Therefore, while the embodiments discussed hereinafter will refer to locations in terms of MCM's, the invention is not so limited.
- adapter affinity information In some HPC environments, e.g., Power6-based HPC environments, the information regarding which IO adapter is plugged into which MCM's slots, referred to herein as adapter affinity information, is known a priori by the HPC environment, is generally stored at startup or is known based upon the system architecture, and generally may be obtained by querying the component firmware. For the purposes of this disclosure, this type of adapter affinity information will be referred to hereinafter as “preconfigured” adapter affinity information.
- This preconfigured adapter affinity information may be used for optimizing performance for parallel jobs that are allocated with multiple adapters for communications, as both the location of each IO adapter and the location of processing element is generally known.
- this location information is not made available to jobs or tasks, or is otherwise not supported by the component firmware, and accordingly, adapter affinity performance may not be achieved programmatically through the use of queries to a component firmware or other layer in a software stack. Environments of this type are therefore referred to herein as environments where preconfigured adapter affinity information is unsupported, and it will be appreciated that such lack of support may be due to limitations of hardware, limitations of software, or some combination of same.
- preconfigured adapter affinity information is generally not supported due to the fact that the IO adapters are plugged into PCIe slots.
- an empirical method for a user to set environmental variables to specify a variety of mappings of IO adapters to an MCM in a best guess approach, and empirically run a benchmark test to obtain the performance for each mapping.
- the performance from the benchmark test may then be used in a best-effort approach to predict which adapter is plugged into which MCM.
- This information can then be saved, e.g., as a mapping that assigns IO adapters to specific locations (here, MCM's) so that future jobs or tasks can access the information to make choices in selecting which IO adapter to use for best performance.
- FIG. 2 next illustrates an example HPC system or environment 80 including a plurality of MCM's 82 , each with a plurality of PE's 84 , and interconnected by a high speed MCM bus 86 , which will be used hereinafter to further illustrate the herein-described empirical approach.
- MCM 82 is also coupled to an associated IO adapter 88 (here a network adapter), which is in turn coupled to a network switch 90 .
- IO adapter 88 here a network adapter
- MCM 0 and MCM 1 MCM's
- network adapters adapter 0 and adapter 1
- PE 0 to PE 7 PE's
- task 0 is bound to PE 0 on MCM 0 , and is allocated ports on both adapters 0 and 1 . Also assume that task 0 's allocated memory 94 is also resident in MCM 0 . If task 0 is to send some data out to switch 90 , it has been found that optimal performance is generally achieved when task 0 uses adapter 0 to send out data, instead of adapter 1 , as the latter scenario would take a longer path to transfer task 0 's data from MCM 0 to MCM 1 over MCM bus 86 , and then out to adapter 1 .
- the performance gain from selecting the closest adapter i.e., the adapter that is on the same MCM as the MCM of a task's bound PE's, is about 5-10 percent, which may be highly beneficial for certain applications.
- adapter 0 could be plugged into the PCIe slot attached to MCM 1
- adapter 1 could be plugged into the PCIe slot attached to MCM 0
- the information as into which PCIe slot each adapter is plugged is not available from the component firmware, as is generally the case in a Power7-based HPC environment, task 0 is not aware of which adapter is the closest one to send out data.
- task 0 picks the adapter plugged into the PCIe slot attached to the other MCM, performance would generally degrade with the longer data transfer path.
- this mapping information may be saved so that future jobs/tasks can retrieve the information and make appropriate choices in selecting which adapter to use for best performance.
- this discovery process may be run once at the completion of system configuration such that runtime empirical testing is not required for a particular job or task.
- FIG. 3 next illustrates a routine 100 for initializing a task in the HPC environment of FIG. 1 .
- a task may be initialized, for example, in connection with initiating a parallel job in the HPC environment, whereby one or more tasks are instantiated throughout the HPC environment on behalf of the job.
- a communication stack e.g., the IBM parallel communication stack available from International Business Machines Corporation, may be configured to select one or more adapters for a task in connection with initialization of the task. As shown in FIG. 3 , for example, startup of a task may be performed (block 102 ), performing various task startup operations that will be apparent to one of ordinary skill in the art having the benefit of the instant disclosure.
- the task is generally allocated one or more adapters, and the task is scheduled (bound) to a certain set of PE's (i.e., one or more PE's), which may, in the illustrated embodiment, include one or more hardware threads of execution, one or more processing core or cores, and/or one or more processors on a certain MCM or MCM's.
- PE's i.e., one or more PE's
- the location(s) of the bound PE's are determined, e.g., using a query to a communications library supported by the parallel communication stack.
- the location refers to the MCM upon which a particular PE resides, although in other embodiments the location may be defined at a different level in the hierarchy of distributed computing components.
- the location(s) of the allocated adapters are determined.
- these locations generally cannot be ascertained via retrieval of preconfigured location data from a communications library.
- the location(s) of the allocated adapters may be determined using an empirical approach.
- the communications library searches, e.g., by comparing the locations for the bound processing elements and the allocated adapters, for a location that is common to both an allocated adapter and a bound processing element. If such a location is found, block 114 passes control to block 116 to select an allocated adapter at the common location as the primary adapter for the task to send/receive data. If such a location is not found, however, block 114 passes control to block 118 to select another allocated adapter, e.g., at a next closest location to a bound processing element, as the primary adapter. Upon completion of block 116 or block 118 , routine 100 is complete. Routine 100 therefore provides adapter affinity information to a parallel task and generally improves performance in the HPC environment of FIG. 1 by reducing the distance data will be moved from its buffer to its communication port and by using a faster integrated bus on the MCM.
- routine 100 may be automated programmatically, e.g., with scripting, or in the alternative, may incorporate some administrator involvement.
- a script may also generate some pattern for an environment variable that assists in identifying mappings more easily for HPC environments with more complex configurations such as 4 MCM's per node, 2 adapters per MCM, etc.
- an environment variable may be configured to speculate which IO adapter is connected to which MCM, such that a communications library may access the environment variable and select an IO adapter based on the setting. If a speculated mapping does provide a communication performance better than the others, then it may be assumed that the IO adapter is in fact plugged into a certain MCM. This speculative adapter affinity information may then be stored or saved and used as input for later jobs so that optimal performance using affinity may be achieved.
- an assumption may be made as to which adapter is plugged into which MCM using this environment variable.
- benchmark tests may be run using different mappings to obtain comparative performance results from which may be determined which, among the different mappings, yields optimum performance. From these comparative performance results, therefore, speculative locations may be determined for each adapter in an HPC environment.
- processing element bindings are kept constant throughout benchmark testing so that tasks are executed on the same processing elements during each benchmark test so that comparable test results, e.g., bandwidth/latency performance, may be collected for each mapping. After the data is collected, the data may be analyzed (e.g., using a script), and a pattern generally emerges showing that a certain mapping produces the best bandwidth/latency performance, thereby empirically determining the likely locations of the adapters.
- routine 120 may initially generate a plurality of candidate mappings (e.g., in the form of candidate environment variables as described above) representing potential mappings of various adapters to various locations.
- candidate mappings e.g., in the form of candidate environment variables as described above
- an HPC system similar to that illustrated in FIG. 2 with two adapters (adapter 0 and adapter 1 ) and two locations represented by MCM 0 and MCM 1 , will be assumed to be the environment in which routine 120 is executed, although it will be appreciated that routine 120 is not limited to use in such an environment.
- the candidate mappings may include, for example, adapter 0 mapped to MCM 0 and adapter 1 mapped to MCM 1 , adapter 0 mapped to MCM 1 and adapter 1 mapped to MCM 0 , both adapters mapped to MCM 0 , and both adapters mapped to MCM 1 .
- processing element bindings are fixed for first and second tasks (tasks 0 and 1 ) such that the tasks will run on the same locations (e.g., MCM's) during each benchmark test.
- block 126 initiates a loop to compare the performance of different mappings in each of two directions: first, from task 0 to task 1 , and second, from task 1 to task 0 .
- block 128 selects the appropriate test direction (e.g., from task 0 to task 1 or from task 1 to task 0 ).
- block 130 initiates a FOR loop to test each candidate mapping.
- block 132 For each such mapping, block 132 binds the first and second tasks (task 0 and task 1 ) to the fixed PE bindings (e.g., task 0 to MCM 0 and task 1 to MCM 1 ).
- block 134 exports the candidate mapping to essentially activate the candidate mapping in the HPC environment.
- a benchmark test is run to send data between the tasks in the selected test direction (e.g., from task 0 to task 1 ).
- PAMI Parallel Active Message Interface
- a benchmark test such as a lightweight bandwidth/latency test optimized for PAMI, or any other combination of suitable benchmark tests, may be used, and as a result of testing, bandwidth and latency performance from task 0 may be obtained.
- the result of the test is saved in block 138 .
- the PAMI library will pick out a different adapter as the affinity adapter for each task; however, the performance result will generally vary with each candidate mapping as the data transfer paths will be different for each mapping.
- block 130 passes control to block 140 to analyze the saved results to speculate, based upon the comparative performance, into which MCM adapter 0 is plugged. Control then passes to block 126 to determine if both direction have been tested. After testing the first direction, therefore, block 126 passes control to block 128 to essentially repeat blocks 128 - 140 for the second test direction (where task 1 sends data to task 0 ).
- block 126 passes control to block 142 to generate and save an optimal mapping based upon the speculated adapter locations so that tasks in future jobs can access and use the optimal mapping in the affinity flow corresponding to FIG. 3 , as described above.
- routine 120 may be adapted to determine speculative adapter affinity information for various types of HPC environments, e.g., where locations are defined at levels other than at the MCM level, where multiple IO adapters may be coupled to a particular location, where other factors impact the relative performance of an IO adapter based upon its location, where more than two tasks at more than two locations are tested, for other types of IO adapters, etc.
- routine 120 may perform alternate and/or additional performance tests for each candidate mapping in other embodiments.
- Routine 120 may also be implemented in a number of manners consistent with the invention, e.g., as one or more scripts, within a library or other software component, as purely automated operation, as a semi-automated operation incorporating user input, or in other manners that will be apparent to one of ordinary skill in the art having the benefit of the instant disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 14/445,546, filed on Jul. 29, 2014 by Wen C. Chen, et al. (ROC920130064US1) entitled “EMPIRICAL DETERMINATION OF ADAPTER AFFINITY IN HIGH PERFORMANCE COMPUTING (HPC) ENVIRONMENT,” the entire disclosure of which is incorporated by reference herein.
- The invention is generally related to computers and computer software, and in particular, to high performance computing (HPC) environments.
- Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, despite these advances, many scientific and business applications still demand massive computing power, which can only be met by extremely high performance computing (HPC) systems. One particular type of computing system architecture that is often used in high performance applications is a parallel processing computing system.
- Generally, a parallel processing computing system comprises a plurality of physical computing nodes and is configured with an HPC application environment, e.g., including a runtime environment that supports the execution of a parallel application across multiple physical computing nodes. Some parallel processing computing systems, which may also be referred to as massively parallel processing computing systems, may have hundreds or thousands of individual physical computing nodes, and provide supercomputer class performance. Each physical computing node is typically of relatively modest computing power and generally includes one or more processors and a set of dedicated memory devices, and is configured with an operating system instance (OSI), as well as components defining a software stack for the runtime environment. To execute a parallel application, a cluster is generally created consisting of physical computing nodes, and one or more parallel tasks are executed within an OSI in each physical computing node and using the runtime environment such that tasks may be executed in parallel across all physical computing nodes in the cluster.
- Performance in parallel processing computing systems can be dependent upon the communication costs associated with communicating data between the components in such systems. Accessing a memory directly coupled to a processor in one physical computing node, for example, may be one or more orders of magnitude faster than accessing a memory on different physical computing node. In addition, retaining the data within a processor and/or directly coupled memory when a processor switches between different tasks can avoid having to reload the data. Accordingly, organizing the tasks executed in a parallel processing computing system to localize operations and data and minimize the latency associated with communicating data between components can have an appreciable impact on performance. For example, tasks can be assigned or bound to particular processors or physical nodes using a concept commonly referred to as affinity such that the tasks will be scheduled for execution if at all possible on the processors or physical nodes to which such tasks have an affinity.
- Likewise, performance can be impacted by the relationship between tasks and other types of components in a parallel processing computing system. As one example, parallel processing computing systems may support multiple input/output (IO) adapters, e.g., network adapters for communication of data over a network. Furthermore, as with distributed processors and memories through the multiple physical computing nodes of a parallel processing computing system, distributing network adapters in this manner may result in variations in latency and bandwidth for tasks accessing such network adapters based upon where the tasks are executed relative to where the network adapters are located. Accordingly, tasks may also be assigned or bound to particular network adapters in a system based upon adapter affinity.
- In some parallel processing computing systems, however, the physical locations of network and other IO adapters resident in such systems may not be available for task scheduling purposes. As such, in such systems it may not be possible to schedule tasks in a manner that optimizes or at least considers adapter performance.
- The invention addresses these and other problems associated with the prior art by providing a method, apparatus and program product that utilize an empirical approach to determine the locations of one or more IO adapters in an HPC environment. Performance tests may be run using a plurality of candidate mappings that map IO adapters to various locations in the HPC environment, and based upon the results of such testing, speculative adapter affinity information may be generated that assigns one or more IO adapters to one or more locations to optimize adapter affinity performance for subsequently-executed tasks.
- Therefore, consistent with one aspect of the invention, adapter affinity may be determined in a high performance computing (HPC) environment of the type including a plurality of distributed computing components defining a plurality of locations and a plurality of input/output (IO) adapters, with each distributed computing component including at least one processing element, and with each IO adapter coupled to a distributed computing component among the plurality of distributed computing components. For each of a plurality of candidate mappings that speculatively map at least one IO adapter to at least one location among the plurality of locations, a performance test may be run for a task executed by a processing element in a distributed computing component at a first location among the plurality of locations, where the plurality of candidate mappings includes first and second candidate mappings, where the first candidate mapping maps a first IO adapter among the plurality of IO adapters to the first location, and the second candidate mapping maps a second IO adapter among the plurality of IO adapters to the first location, and where running the performance test respectively generates first and second test results for the first and second candidate mappings. Speculative adapter affinity information may be generated that assigns at least one IO adapter to at least one location among the plurality of locations based upon the performance test run for each of the plurality of candidate mappings, including assigning the first IO adapter to the first location based upon a comparison of the first and second test results for the first and second candidate mappings.
- These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.
-
FIG. 1 is a block diagram of an example hardware and software environment suitable for empirically determining adapter affinity in a manner consistent with the invention. -
FIG. 2 is a block diagram of an example adapter affinity determination operation consistent with the invention. -
FIG. 3 is a flowchart illustrating an example sequence of operations for initializing a task in the HPC environment ofFIG. 1 . -
FIG. 4 is a flowchart illustrating an example sequence of operations for empirically generating a mapping in the HPC environment ofFIG. 1 . - Embodiments consistent with the invention utilize an empirical approach to determine the locations of one or more IO adapters in an HPC environment. Performance tests may be run using a plurality of candidate mappings that map IO adapters to various locations in the HPC environment, and based upon the results of such testing, speculative adapter affinity information may be generated that assigns one or more IO adapters to one or more locations to optimize adapter affinity performance for subsequently-executed tasks.
- In this regard, an HPC environment consistent with the invention may be considered to include a hardware and/or software environment suitable for hosting an HPC application, generally implemented using a plurality of parallel tasks. From a hardware perspective, an HPC environment includes a plurality of distributing computing components, organized in one or more hierarchical levels, and supporting the concurrent execution of a plurality of hardware threads of execution. In many production environments, an HPC application may be implemented using hundreds, thousands, or more parallel tasks running on hundreds, thousands, or more hardware threads of execution.
- Numerous variations and modifications will be apparent to one of ordinary skill in the art, as will become apparent from the description below. Therefore, the invention is not limited to the specific implementations discussed herein.
- Turning to the Drawings, wherein like parts denote like numbers throughout the several views,
FIG. 1 illustrates the principal hardware and software components in anapparatus 50 capable of implementing an HPC environment consistent with the invention.Apparatus 50 is illustrated as an HPC system incorporating a plurality ofphysical computing nodes 52 coupled to one another over a cluster network 54, and including a plurality ofprocessors 56 coupled to a plurality ofmemory devices 58 representing the computational and memory resources of the HPC system. -
Apparatus 50 may be implemented using any of a number of different architectures suitable for executing HPC applications, e.g., a supercomputer architecture. For example, in one embodiment,apparatus 50 may be implemented as a Power7 IH-based system available from International Business Machines Corporation. In this implementation,processors 56 andmemory devices 58 may be disposed onmulti-chip modules 60, e.g., quad chip modules (QCM's), which in turn may be disposed within aphysical computing node 52 along with ahub chip 64 that provides access to one or more input/output (I/O)adapters 66, which may be used to access network, storage and other external resources. Multiple (e.g., eight) physical computing nodes 52 (also referred to as octants) may be organized together intomodules 62, e.g., rack modules or drawers, and physical computing nodes may be further organized into supernodes, cabinets, data centers, etc. It will be appreciated that other architectures suitable for executing HPC applications may be used, e.g., any of the Blue Gene/L, Blue Gene/P, and Blue Gene/Q architectures available from International Business Machines Corporation, among others. Therefore, the invention is not limited to use with the Power7 IH architecture disclosed herein. - Each
processor 56 may be implemented as a single or multi-threaded processor and/or as a single or multi-core processor, while eachmemory 58 may be considered to include one or more levels of memory devices, e.g., a DRAM-based main storage, as well as one or more levels of data, instruction and/or combination caches, with certain caches either serving individual processors or multiple processors as is well known in the art. In addition, the memory ofapparatus 50 may be considered to include memory storage physically located elsewhere inapparatus 50, e.g., any cache memory in a processor, as well as any storage capacity used as a virtual memory, e.g., as stored on a mass storage device or on another computer coupled toapparatus 50. -
Apparatus 50 operates under the control of one or more kernels, hypervisors, operating systems, etc., and executes or otherwise relies upon various computer software applications, components, programs, objects, modules, data structures, etc., as will be described in greater detail below. Moreover, various applications, components, programs, objects, modules, etc. may also execute on one or more processors in another computer coupled toapparatus 50 via network, e.g., in a distributed or client-server computing environment, whereby the processing required to implement the functions of a computer program may be allocated to multiple computers over a network. - For example,
FIG. 1 illustrates various software components 70-76 forming a software stack that may be resident within thememories 58 in anMCM 60. Ahypervisor 70 may host one or moreoperating system instances 72, within which may reside one ormore tasks 74. Additional components, e.g., ajob management component 76, parallel operating environment (POE) or other load balancing functionality, etc., may further support the execution of parallel tasks and jobs inapparatus 50. It will be appreciated that additional and/or alternate components may be supported in a software stack for an HPC environment, and that components may be replicated and/or distributed among the various memories, MCM's, nodes, etc. inapparatus 50. In the illustrated embodiment, for example, an IBM Power HPC software stack available from International Business Machines Corporation may be used, although the invention is not so limited. - It will be appreciated that the present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing one or more processors to carry out aspects of the present invention.
- The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
- In addition, computer readable program instructions, of which one or more may collectively be referred to herein as “program code,” may be identified herein based upon the application within which such instructions are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature. Furthermore, given the typically endless number of manners in which computer programs may be organized into routines, procedures, methods, modules, objects, and the like, as well as the various manners in which program functionality may be allocated among various software layers that are resident within a typical computer (e.g., operating systems, libraries, API's, applications, applets, etc.), it should be appreciated that the invention is not limited to the specific organization and allocation of program functionality described herein.
- Those skilled in the art will recognize that the example environment illustrated in
FIG. 1 is not intended to limit the present invention. Indeed, those skilled in the art will recognize that other alternative hardware and/or software environments may be used without departing from the scope of the invention. - Embodiments consistent with the invention are directed in part to an empirical approach for determining adapter affinity in an HPC environment, e.g., in a parallel processing computing system incorporating a plurality of input/output (IO) adapters such as network adapters, storage adapters, Host Channel Adapters (HCA's), etc. Embodiments consistent with the invention, in particular, determine adapter affinity information using an empirical approach to attempt to make a best guess as to which distributed computing component among a plurality of computing components an IO adapter is coupled using performance resulting from one or more performance tests or experiments. The information collected may then be used later as an input to future jobs or tasks to improve performance when executing those jobs or tasks.
- In the illustrated embodiment, for example, the HPC environment is a Power7-based HPC environment in which computing resources are organized in a multi-level hierarchy where one or more processing threads are implemented within one or more processing cores on one or more processors, and where one or more processors are disposed on one or more multi-chip modules (MCM's). MCM's in turn are organized into physical computing nodes (also referred to as octants), which in turn may be organized into modules, e.g., rack modules or drawers, supernodes, cabinets, data centers, etc. Through this organization, thousands or millions of individual parallel threads of execution may be supported for concurrent execution within an HPC environment.
- Also, within a Power7-based HPC environment, individual MCM's incorporate one or more PCIe slots for interfacing with one or more IO adapters. In addition, the processors are packaged extremely close together with integrated caches and buses to enable fast data transfer and reliability, such that each MCM creates a substantially complete physical package for communications purposes. The parallel communication stack in such an environment also generally allows users to specify a parallel task to be scheduled to run on a set of processing elements, which within the context of the invention may be considered to be one or more hardware threads of execution on a processor or processing core, one or more processing cores on a processor, one or more processors on an MCM or spread across multiple or all MCM's in the HPC environment, or any combination of same. In the embodiment discussed hereinafter, for example, a processing element may be considered to be a hardware thread of execution. A task is also generally allocated with a fixed set of adapter resource units, identified by an IO adapter's ID and its logical port number, to send and receive data.
- The concept of adapter affinity performance within the context of this environment is generally that when a task is scheduled on a particular processing element, e.g., a hardware thread of execution, processing core, processor, MCM, or other level of the hierarchy in a multi-level processing architecture, performance is optimized when the task selects or is otherwise assigned an adapter resource, allocated and available to the task, that is “closest” to the processing element(s) that it was scheduled to run on to achieve optimal bandwidth and latency performance due to the distance and capacity the data has to be moved.
- To achieve this affinity performance at run time, however, a task generally must be aware of both which processing element(s) it is scheduled on and which adapter resource or resources are the “closest” to that processing element. In a Power7-based HPC environment, for example, an adapter resource is closest when it is plugged into the PCIe slot attached to the same MCM that contains the scheduled processing element(s).
- Thus, more generally, selection of an adapter resource to optimize affinity performance for a task incorporates an awareness of the “locations” of both the processing element(s) upon which a task is scheduled for execution and the adapter resource(s) that may be utilized by the task. These locations may be considered within an overall hierarchy of a plurality of distributed computing components in an HPC environment. Thus, within the context of the aforementioned Power7-based HPC environment, the locations of interest from the perspective of adapter affinity performance within the hierarchy of distributed computing components are generally defined at the MCM level in the hierarchy, as it is to a particular MCM that an IO adapter is generally coupled that determines adapter affinity performance in this environment.
- It will be appreciated, however, that in other HPC environments, the locations of interest may be defined at different levels in the hierarchy, e.g., at the processor level, the core level, the node level, the supernode level, the cabinet level, or any other level in a multi-level hierarchy where communication latency differs between distributed computing components and adapter resources in the same location and distributed computing components and adapter resources in different locations. Therefore, while the embodiments discussed hereinafter will refer to locations in terms of MCM's, the invention is not so limited.
- In some HPC environments, e.g., Power6-based HPC environments, the information regarding which IO adapter is plugged into which MCM's slots, referred to herein as adapter affinity information, is known a priori by the HPC environment, is generally stored at startup or is known based upon the system architecture, and generally may be obtained by querying the component firmware. For the purposes of this disclosure, this type of adapter affinity information will be referred to hereinafter as “preconfigured” adapter affinity information.
- This preconfigured adapter affinity information may be used for optimizing performance for parallel jobs that are allocated with multiple adapters for communications, as both the location of each IO adapter and the location of processing element is generally known. However, in other HPC environments, e.g., the aforementioned Power7-based HPC environment, this location information is not made available to jobs or tasks, or is otherwise not supported by the component firmware, and accordingly, adapter affinity performance may not be achieved programmatically through the use of queries to a component firmware or other layer in a software stack. Environments of this type are therefore referred to herein as environments where preconfigured adapter affinity information is unsupported, and it will be appreciated that such lack of support may be due to limitations of hardware, limitations of software, or some combination of same. For example, in the aforementioned Power7-based HPC environment, preconfigured adapter affinity information is generally not supported due to the fact that the IO adapters are plugged into PCIe slots.
- In embodiments consistent with the invention, on the other hand, an empirical method is provided for a user to set environmental variables to specify a variety of mappings of IO adapters to an MCM in a best guess approach, and empirically run a benchmark test to obtain the performance for each mapping. The performance from the benchmark test may then be used in a best-effort approach to predict which adapter is plugged into which MCM. This information can then be saved, e.g., as a mapping that assigns IO adapters to specific locations (here, MCM's) so that future jobs or tasks can access the information to make choices in selecting which IO adapter to use for best performance.
-
FIG. 2 next illustrates an example HPC system orenvironment 80 including a plurality of MCM's 82, each with a plurality of PE's 84, and interconnected by a high speed MCM bus 86, which will be used hereinafter to further illustrate the herein-described empirical approach. EachMCM 82 is also coupled to an associated IO adapter 88 (here a network adapter), which is in turn coupled to anetwork switch 90. For the purposes of this example, two MCM's (MCM 0 and MCM 1) and two network adapters (adapter 0 and adapter 1), are shown, with each MCM includes eight PE's (PE 0 to PE 7), though it will be appreciated that any number of each of these components may be utilized in other embodiments. - Also illustrated in
FIG. 2 are twoexample tasks 92,task 0 andtask 1. Assume thattask 0 is bound toPE 0 onMCM 0, and is allocated ports on bothadapters task 0's allocatedmemory 94 is also resident inMCM 0. Iftask 0 is to send some data out to switch 90, it has been found that optimal performance is generally achieved whentask 0 usesadapter 0 to send out data, instead ofadapter 1, as the latter scenario would take a longer path to transfertask 0's data fromMCM 0 toMCM 1 over MCM bus 86, and then out toadapter 1. It has been found, for example, that in a Power7-based HPC environment, the performance gain from selecting the closest adapter, i.e., the adapter that is on the same MCM as the MCM of a task's bound PE's, is about 5-10 percent, which may be highly beneficial for certain applications. - In other configurations, however,
adapter 0 could be plugged into the PCIe slot attached toMCM 1, andadapter 1 could be plugged into the PCIe slot attached toMCM 0. Thus, if the information as into which PCIe slot each adapter is plugged is not available from the component firmware, as is generally the case in a Power7-based HPC environment,task 0 is not aware of which adapter is the closest one to send out data. Thus, iftask 0 picks the adapter plugged into the PCIe slot attached to the other MCM, performance would generally degrade with the longer data transfer path. - However, by utilizing the empirical approach disclosed herein to map adapters to MCM's, this mapping information may be saved so that future jobs/tasks can retrieve the information and make appropriate choices in selecting which adapter to use for best performance. In addition, in many embodiments this discovery process may be run once at the completion of system configuration such that runtime empirical testing is not required for a particular job or task.
-
FIG. 3 next illustrates a routine 100 for initializing a task in the HPC environment ofFIG. 1 . A task may be initialized, for example, in connection with initiating a parallel job in the HPC environment, whereby one or more tasks are instantiated throughout the HPC environment on behalf of the job. - In some embodiments, for example, a communication stack, e.g., the IBM parallel communication stack available from International Business Machines Corporation, may be configured to select one or more adapters for a task in connection with initialization of the task. As shown in
FIG. 3 , for example, startup of a task may be performed (block 102), performing various task startup operations that will be apparent to one of ordinary skill in the art having the benefit of the instant disclosure. As illustrated inblocks - Next, in
block 108, the location(s) of the bound PE's are determined, e.g., using a query to a communications library supported by the parallel communication stack. In this embodiment, the location refers to the MCM upon which a particular PE resides, although in other embodiments the location may be defined at a different level in the hierarchy of distributed computing components. - Next, in
block 110, the location(s) of the allocated adapters are determined. As noted above, however, in some HPC environments, e.g., HPC environments where preconfigured adapter affinity information is unsupported, these locations generally cannot be ascertained via retrieval of preconfigured location data from a communications library. Thus, as will be explained in greater detail below, the location(s) of the allocated adapters may be determined using an empirical approach. - Next, in
block 112, the communications library searches, e.g., by comparing the locations for the bound processing elements and the allocated adapters, for a location that is common to both an allocated adapter and a bound processing element. If such a location is found, block 114 passes control to block 116 to select an allocated adapter at the common location as the primary adapter for the task to send/receive data. If such a location is not found, however, block 114 passes control to block 118 to select another allocated adapter, e.g., at a next closest location to a bound processing element, as the primary adapter. Upon completion ofblock 116 or block 118, routine 100 is complete.Routine 100 therefore provides adapter affinity information to a parallel task and generally improves performance in the HPC environment ofFIG. 1 by reducing the distance data will be moved from its buffer to its communication port and by using a faster integrated bus on the MCM. - It should be noted that the steps described in routine 100 may be automated programmatically, e.g., with scripting, or in the alternative, may incorporate some administrator involvement. A script may also generate some pattern for an environment variable that assists in identifying mappings more easily for HPC environments with more complex configurations such as 4 MCM's per node, 2 adapters per MCM, etc.
- Now turning to the herein-described empirical approach, rather than using preconfigured adapter affinity information, speculative adapter affinity information, derived via empirical testing, is used in connection with adapter affinity performance. In some embodiments, an environment variable may be configured to speculate which IO adapter is connected to which MCM, such that a communications library may access the environment variable and select an IO adapter based on the setting. If a speculated mapping does provide a communication performance better than the others, then it may be assumed that the IO adapter is in fact plugged into a certain MCM. This speculative adapter affinity information may then be stored or saved and used as input for later jobs so that optimal performance using affinity may be achieved.
- In the illustrated embodiment, for example, an environment variable, ADAPTER_MCM_MAP=
n —0,n —1,n —2 . . . , may be used to define a physical mapping of which adapter is plugged into which MCM. For example, on an HPC system such as illustrated inFIG. 2 , with two MCM's (MCM 0 and MCM 1), with two adapters (adapter 0 and adapter 1), each plugged into one of the MCM's, an assumption may be made as to which adapter is plugged into which MCM using this environment variable. For example, ADAPTER_MCM_MAP=0,1 may be used to specify that adapter 0 (adp_0) is on MCM 0 (mcm_0) and adapter 1 (adp_1) is on MCM 1 (mcm_1). - In some embodiments of the invention, for example, benchmark tests may be run using different mappings to obtain comparative performance results from which may be determined which, among the different mappings, yields optimum performance. From these comparative performance results, therefore, speculative locations may be determined for each adapter in an HPC environment. In some embodiments, processing element bindings are kept constant throughout benchmark testing so that tasks are executed on the same processing elements during each benchmark test so that comparable test results, e.g., bandwidth/latency performance, may be collected for each mapping. After the data is collected, the data may be analyzed (e.g., using a script), and a pattern generally emerges showing that a certain mapping produces the best bandwidth/latency performance, thereby empirically determining the likely locations of the adapters.
- Now turning to
FIG. 4 , an example generatemapping routine 120 is illustrated for empirically determining adapter affinity information for one or more adapters in a manner consistent with the invention. As shown inblock 122, routine 120 may initially generate a plurality of candidate mappings (e.g., in the form of candidate environment variables as described above) representing potential mappings of various adapters to various locations. For the purposes of simplifying the explanation, an HPC system similar to that illustrated inFIG. 2 , with two adapters (adapter 0 and adapter 1) and two locations represented byMCM 0 andMCM 1, will be assumed to be the environment in whichroutine 120 is executed, although it will be appreciated that routine 120 is not limited to use in such an environment. Thus, for example, the candidate mappings may include, for example,adapter 0 mapped toMCM 0 andadapter 1 mapped toMCM 1,adapter 0 mapped toMCM 1 andadapter 1 mapped toMCM 0, both adapters mapped toMCM 0, and both adapters mapped toMCM 1. - For each candidate mapping, it is desirable to keep the binding of tasks to processing elements constant such that comparable test results may be obtained. Thus, as illustrated by
block 124, processing element bindings are fixed for first and second tasks (tasks 0 and 1) such that the tasks will run on the same locations (e.g., MCM's) during each benchmark test. - Next, block 126 initiates a loop to compare the performance of different mappings in each of two directions: first, from
task 0 totask 1, and second, fromtask 1 totask 0. For each direction, block 128 selects the appropriate test direction (e.g., fromtask 0 totask 1 or fromtask 1 to task 0). Next, block 130 initiates a FOR loop to test each candidate mapping. - For each such mapping, block 132 binds the first and second tasks (
task 0 and task 1) to the fixed PE bindings (e.g.,task 0 toMCM 0 andtask 1 to MCM 1). Next, block 134 exports the candidate mapping to essentially activate the candidate mapping in the HPC environment. For example, one candidate mapping may be ADAPTER_MCM_MAP=0,1, representingadapter 0 being plugged intoMCM 0, andadapter 1 plugged intoMCM 1. - Next, in
block 136, a benchmark test is run to send data between the tasks in the selected test direction (e.g., fromtask 0 to task 1). For example, in a Power7-based HPC environment, a Parallel Active Message Interface (PAMI) library test may run through the affinity flow described above in connection withFIG. 3 , read the input from the activated mapping and pick out an adapter as the affinity adapter (e.g., for the ADAPTER_MCM_MAP=0,1 mapping, the library may selectadapter 0 as the affinity adapter fortask 0 sinceMCM 0 is the common MCM fortask 0's PE and adapter). In this environment, a benchmark test such as a lightweight bandwidth/latency test optimized for PAMI, or any other combination of suitable benchmark tests, may be used, and as a result of testing, bandwidth and latency performance fromtask 0 may be obtained. The result of the test is saved inblock 138. - Control next returns to block 130 to repeat blocks 132-138 for the other candidate mappings, e.g., ADAPTER_MCM_MAP=1,0, 0,0 or 1,1. Using each candidate mapping, the PAMI library will pick out a different adapter as the affinity adapter for each task; however, the performance result will generally vary with each candidate mapping as the data transfer paths will be different for each mapping.
- Once all candidate mappings have been tested, block 130 passes control to block 140 to analyze the saved results to speculate, based upon the comparative performance, into which
MCM adapter 0 is plugged. Control then passes to block 126 to determine if both direction have been tested. After testing the first direction, therefore, block 126 passes control to block 128 to essentially repeat blocks 128-140 for the second test direction (wheretask 1 sends data to task 0). - Once both directions have been tested, block 126 passes control to block 142 to generate and save an optimal mapping based upon the speculated adapter locations so that tasks in future jobs can access and use the optimal mapping in the affinity flow corresponding to
FIG. 3 , as described above. - Thus, for example, in the HPC environment illustrated in
FIG. 2 , execution of the generate mapping routine ofFIG. 4 , withtask 0 executing inMCM 0 andtask 1 executing inMCM 1, would generally speculate thatadapter 0 is plugged intoMCM 0 andadapter 1 is plugged intoMCM 1. Consequently, during the task initialization ofFIG. 3 , for a task bound to a processing element onMCM 0,adapter 0 would generally be selected as the primary adapter for the task inblock 116 based upon a speculated location ofadapter 0 onMCM 0, as determined inblock 110. - It will be appreciated that routine 120 may be adapted to determine speculative adapter affinity information for various types of HPC environments, e.g., where locations are defined at levels other than at the MCM level, where multiple IO adapters may be coupled to a particular location, where other factors impact the relative performance of an IO adapter based upon its location, where more than two tasks at more than two locations are tested, for other types of IO adapters, etc. In addition, routine 120 may perform alternate and/or additional performance tests for each candidate mapping in other embodiments.
Routine 120 may also be implemented in a number of manners consistent with the invention, e.g., as one or more scripts, within a library or other software component, as purely automated operation, as a semi-automated operation incorporating user input, or in other manners that will be apparent to one of ordinary skill in the art having the benefit of the instant disclosure. - Various modifications may be to the illustrated embodiments consistent with the invention. Therefore, the invention lies in the claims hereinafter appended.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/530,095 US9606837B2 (en) | 2014-07-29 | 2014-10-31 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/445,546 US9495217B2 (en) | 2014-07-29 | 2014-07-29 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
US14/530,095 US9606837B2 (en) | 2014-07-29 | 2014-10-31 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/445,546 Continuation US9495217B2 (en) | 2014-07-29 | 2014-07-29 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
Publications (2)
Publication Number | Publication Date |
---|---|
US20160034313A1 true US20160034313A1 (en) | 2016-02-04 |
US9606837B2 US9606837B2 (en) | 2017-03-28 |
Family
ID=55180123
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/445,546 Expired - Fee Related US9495217B2 (en) | 2014-07-29 | 2014-07-29 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
US14/530,095 Expired - Fee Related US9606837B2 (en) | 2014-07-29 | 2014-10-31 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/445,546 Expired - Fee Related US9495217B2 (en) | 2014-07-29 | 2014-07-29 | Empirical determination of adapter affinity in high performance computing (HPC) environment |
Country Status (1)
Country | Link |
---|---|
US (2) | US9495217B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180276814A1 (en) * | 2017-03-24 | 2018-09-27 | Curadel, LLC | Tissue identification by an imaging system using color information |
CN109344043A (en) * | 2018-09-26 | 2019-02-15 | 郑州云海信息技术有限公司 | A kind of method for analyzing performance and relevant apparatus |
US11658882B1 (en) * | 2020-01-21 | 2023-05-23 | Vmware, Inc. | Algorithm-based automatic presentation of a hierarchical graphical representation of a computer network structure |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2960787B1 (en) * | 2014-06-27 | 2016-09-21 | Fujitsu Limited | A method of executing an application on a computer system, a resource manager and a high performance computer system |
US9495217B2 (en) | 2014-07-29 | 2016-11-15 | International Business Machines Corporation | Empirical determination of adapter affinity in high performance computing (HPC) environment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6728258B1 (en) * | 1995-11-15 | 2004-04-27 | Hitachi, Ltd. | Multi-processor system and its network |
US20080141251A1 (en) * | 2006-12-08 | 2008-06-12 | Barry Bradley Arndt | Binding processes in a non-uniform memory access system |
US20080189433A1 (en) * | 2007-02-02 | 2008-08-07 | Nelson Randall S | Methods and Apparatus for Assigning a Physical Adapter to a Virtual Adapter |
US20100275213A1 (en) * | 2009-04-28 | 2010-10-28 | Ryuji Sakai | Information processing apparatus, parallel process optimization method |
US20110138396A1 (en) * | 2009-11-30 | 2011-06-09 | International Business Machines Corporation | Method and system for data distribution in high performance computing cluster |
US20110154302A1 (en) * | 2009-12-21 | 2011-06-23 | Soeren Balko | Adding services to application platform via extension |
US8260925B2 (en) * | 2008-11-07 | 2012-09-04 | International Business Machines Corporation | Finding workable virtual I/O mappings for HMC mobile partitions |
US20120311299A1 (en) * | 2001-02-24 | 2012-12-06 | International Business Machines Corporation | Novel massively parallel supercomputer |
US20140026111A1 (en) * | 2011-04-11 | 2014-01-23 | Gregory Michael Stitt | Elastic computing |
US20150248312A1 (en) * | 2014-02-28 | 2015-09-03 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Performance-aware job scheduling under power constraints |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6295575B1 (en) * | 1998-06-29 | 2001-09-25 | Emc Corporation | Configuring vectors of logical storage units for data storage partitioning and sharing |
US6609131B1 (en) | 1999-09-27 | 2003-08-19 | Oracle International Corporation | Parallel partition-wise joins |
WO2003048961A1 (en) * | 2001-12-04 | 2003-06-12 | Powerllel Corporation | Parallel computing system, method and architecture |
US7386739B2 (en) * | 2005-05-03 | 2008-06-10 | International Business Machines Corporation | Scheduling processor voltages and frequencies based on performance prediction and power constraints |
US20070073993A1 (en) | 2005-09-29 | 2007-03-29 | International Business Machines Corporation | Memory allocation in a multi-node computer |
US9430297B2 (en) * | 2008-12-15 | 2016-08-30 | International Business Machines Corporation | Load balancing of adapters on a multi-adapter node |
US8589941B2 (en) * | 2010-04-23 | 2013-11-19 | International Business Machines Corporation | Resource affinity via dynamic reconfiguration for multi-queue network adapters |
US8966457B2 (en) * | 2011-11-15 | 2015-02-24 | Global Supercomputing Corporation | Method and system for converting a single-threaded software program into an application-specific supercomputer |
US9495217B2 (en) * | 2014-07-29 | 2016-11-15 | International Business Machines Corporation | Empirical determination of adapter affinity in high performance computing (HPC) environment |
-
2014
- 2014-07-29 US US14/445,546 patent/US9495217B2/en not_active Expired - Fee Related
- 2014-10-31 US US14/530,095 patent/US9606837B2/en not_active Expired - Fee Related
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6728258B1 (en) * | 1995-11-15 | 2004-04-27 | Hitachi, Ltd. | Multi-processor system and its network |
US20120311299A1 (en) * | 2001-02-24 | 2012-12-06 | International Business Machines Corporation | Novel massively parallel supercomputer |
US20080141251A1 (en) * | 2006-12-08 | 2008-06-12 | Barry Bradley Arndt | Binding processes in a non-uniform memory access system |
US20080189433A1 (en) * | 2007-02-02 | 2008-08-07 | Nelson Randall S | Methods and Apparatus for Assigning a Physical Adapter to a Virtual Adapter |
US8683022B2 (en) * | 2007-02-02 | 2014-03-25 | International Business Machines Corporation | Methods and apparatus for assigning a physical adapter to a virtual adapter |
US8260925B2 (en) * | 2008-11-07 | 2012-09-04 | International Business Machines Corporation | Finding workable virtual I/O mappings for HMC mobile partitions |
US20100275213A1 (en) * | 2009-04-28 | 2010-10-28 | Ryuji Sakai | Information processing apparatus, parallel process optimization method |
US20110138396A1 (en) * | 2009-11-30 | 2011-06-09 | International Business Machines Corporation | Method and system for data distribution in high performance computing cluster |
US20110154302A1 (en) * | 2009-12-21 | 2011-06-23 | Soeren Balko | Adding services to application platform via extension |
US20140026111A1 (en) * | 2011-04-11 | 2014-01-23 | Gregory Michael Stitt | Elastic computing |
US20150248312A1 (en) * | 2014-02-28 | 2015-09-03 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Performance-aware job scheduling under power constraints |
Non-Patent Citations (12)
Title |
---|
Douglas C. Schmid et al.; A High Performance End System Architecture for Real Time CORBA; 1997 IEEE; pp. 72-77; <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=565659> * |
Gang Cheng et al.; An Interactive Remote Visualization Environment for an Electromagnetic Scattering Simulation on a High Performance Computing System; 1993 ACM; pp. 317-326; <http://dl.acm.org/citation.cfm?id=169743> * |
Hemant Kanakia; The VMP Network Adapter Board (NAB) High-Performance Network Communication for Multiprocessors; 1988 ACM; pp. 175-187; <http://dl.acm.org/citation.cfm?id=52343> * |
John R. Wernsing; Elastic Computing A Framework for Transparent, Portable, and Adaptive Multi-core Heterogeneous Computing; 2010 ACM; pp. 115-124; <http://dl.acm.org/citation.cfm?id=1755906> * |
L. A. DRUMMOND et al.; An Overview of the Advanced CompuTational Software (ACTS) Collection; 2005 ACM; pp. 282-301; <http://dl.acm.org/citation.cfm?id=1089016&CFID=634144449&CFTOKEN=41028965> * |
Li-jie Jin et al.; From Metacomputing to Metabusiness Processing; 2000 IEEE; pp. 99-108; <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=889010> * |
Luiz M. R. Gadelha Jr; Exploring Provenance in High Performance Scientific Computing; 2011 ACM; pp. 17-20; <http://dl.acm.org/citation.cfm?id=2125643&CFID=634144449&CFTOKEN=41028965> * |
Marc Snir; The Future of Supercomputing; 2014 ACM; pp. 261-262; <http://dl.acm.org/citation.cfm?id=2616585&CFID=634144449&CFTOKEN=41028965> * |
R. Rajamony et al.; PERCS TheIBM POWER7-IH high-performance computing system; 2011 IEEE; 12 pages; <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5739087> * |
Richard M. Yoo et al.; Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing; 2013 ACM; 11 pages; <http://dl.acm.org/citation.cfm?id=2503232&CFID=634144449&CFTOKEN=41028965> * |
Songqing Yue; Program Transformation Techniques Applied to Languages Used in High Performance Computing; 2013 ACM; pp. 49-51; <http://dl.acm.org/citation.cfm?id=2508081&CFID=634144449&CFTOKEN=41028965> * |
X. Sharon Hu et al.; Hardware Software Co-Design for High Performance Computing Challenges and Opportunities; 2010 ACM; pp. 63-64; <http://dl.acm.org/citation.cfm?id=1878975&CFID=634144449&CFTOKEN=41028965> * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180276814A1 (en) * | 2017-03-24 | 2018-09-27 | Curadel, LLC | Tissue identification by an imaging system using color information |
CN109344043A (en) * | 2018-09-26 | 2019-02-15 | 郑州云海信息技术有限公司 | A kind of method for analyzing performance and relevant apparatus |
US11658882B1 (en) * | 2020-01-21 | 2023-05-23 | Vmware, Inc. | Algorithm-based automatic presentation of a hierarchical graphical representation of a computer network structure |
Also Published As
Publication number | Publication date |
---|---|
US9606837B2 (en) | 2017-03-28 |
US20160034312A1 (en) | 2016-02-04 |
US9495217B2 (en) | 2016-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10514960B2 (en) | Iterative rebalancing of virtual resources among VMs to allocate a second resource capacity by migrating to servers based on resource allocations and priorities of VMs | |
US10379883B2 (en) | Simulation of high performance computing (HPC) application environment using virtual nodes | |
US10977086B2 (en) | Workload placement and balancing within a containerized infrastructure | |
US9606837B2 (en) | Empirical determination of adapter affinity in high performance computing (HPC) environment | |
US10025503B2 (en) | Autonomous dynamic optimization of platform resources | |
US9413819B1 (en) | Operating system interface implementation using network-accessible services | |
US10740147B2 (en) | Merging connection pools to form a logical pool of connections during a preset period of time thereby more efficiently utilizing connections in connection pools | |
EP2724244A2 (en) | Native cloud computing via network segmentation | |
US9471352B1 (en) | Capability based placement | |
US11886898B2 (en) | GPU-remoting latency aware virtual machine migration | |
US9563451B2 (en) | Allocating hypervisor resources | |
US20150269073A1 (en) | Compiler-generated memory mapping hints | |
US20170161042A1 (en) | Deployment of processing components of computing infrastructure using annotated command objects | |
US9612843B1 (en) | Heterogeneous core microarchitecture | |
Emu et al. | Designing a new scalable load test system for distributed environment | |
US9176910B2 (en) | Sending a next request to a resource before a completion interrupt for a previous request | |
US10387218B2 (en) | Lock profiling tool to identify code bottlenecks in a storage controller | |
US20230086195A1 (en) | Efficient and extensive function groups with multi-instance function support for cloud based processing | |
Farshin et al. | Scheduling-A Secret Sauce For Resource Disaggregation | |
de Lacerda Ruivo et al. | Efficient High-Performance Computing with Infiniband Hardware Virtualization | |
Conley et al. | Sorting 100 TB on Google Compute Engine | |
Gong | Analysis on a Cluster Server Virtualization Technology Architecture and Its Performance | |
Dao et al. | Improving Hadoop MapReduce Performance on the FX10 supercomputer with JVM Reuse | |
Cooperman | DMTCP for Checkpoint-Restart: its Past, Present and Future |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHEN, WEN C.;JEA, TAAI-YANG;LEPERA, WILLIAM P.;AND OTHERS;SIGNING DATES FROM 20140721 TO 20140728;REEL/FRAME:034118/0721 |
|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE SECOND ASSIGNOR'S NAME PREVIOUSLY RECORDED AT REEL: 034118 FRAME: 0721. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNORS:CHEN, WEN C.;JEA, TSAI-YANG;LEPERA, WILLIAM P.;AND OTHERS;SIGNING DATES FROM 20140721 TO 20140728;REEL/FRAME:038538/0011 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210328 |