WO2012036898A1 - Numa i/o framework - Google Patents

Numa i/o framework Download PDF

Info

Publication number
WO2012036898A1
WO2012036898A1 PCT/US2011/049852 US2011049852W WO2012036898A1 WO 2012036898 A1 WO2012036898 A1 WO 2012036898A1 US 2011049852 W US2011049852 W US 2011049852W WO 2012036898 A1 WO2012036898 A1 WO 2012036898A1
Authority
WO
WIPO (PCT)
Prior art keywords
numa
node
framework
numa node
nodes
Prior art date
Application number
PCT/US2011/049852
Other languages
English (en)
French (fr)
Inventor
Nicolas G. Droux
Jonathan Chew
Rajagopal Kunhappan
Original Assignee
Oracle International Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corporation filed Critical Oracle International Corporation
Priority to EP11755204.2A priority Critical patent/EP2616934B1/en
Priority to CN201180052399.2A priority patent/CN103201722B/zh
Publication of WO2012036898A1 publication Critical patent/WO2012036898A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load

Definitions

  • Some modern computing system architectures utilize physically and conceptually separated nodes to leverage the speed of computing hardware.
  • input/output devices may be located in various physical locations on the computer. Each input/output device may be used by different applications and processes on the separate nodes. Kernel elements executing on such architectures may be responsible for facilitating communication between an input/output device and an application which is physically remote from that device.
  • the invention in general, in one aspect, relates to a non-transitory computer readable medium that includes software instructions, which when executed by a processor perform a method.
  • the method includes an input/output (I/O) subsystem receiving a request to use an I/O device from a process, determining a first resource to service the request, generating a first I/O object corresponding to the first resource, wherein the first I/O object is unbound, and sending the first I/O object to a Non-Uniform Memory Access (NUMA) I/O Framework.
  • I/O input/output
  • the method further includes the NUMA I/O Framework selecting a first NUMA node of a plurality of NUMA nodes, to which to bind the first I/O object and binding the first I/O object to the first NUMA node.
  • the method further includes servicing the request by processing, on the first NUMA node, the first resource corresponding to the first I/O object.
  • the invention in general, in one aspect, relates to a system that includes Non-
  • the NUMA nodes include a first NUMA node comprising a first processor and a first memory, and a second NUMA node comprising a second processor and a second memory.
  • the system further includes an input/output (I/O) device group comprising an I/O device.
  • the system also includes an I/O Subsystem executing on at least one of the NUMA nodes and is configured to receive a request to use the I/O device from a process executing on the first NUMA node, determine a first resource necessary to service the request, and 11 049852 generate a first I/O object corresponding to the first resource wherein the first I/O object is unbound.
  • the system further includes a NUMA I/O Framework executing on at least one of the plurality of NUMA nodes and configured to receive the first I/O object from the I/O Subsystem, select the second NUMA node, and bind the first I/O object to the second NUMA node.
  • the request is serviced by processing, on the second NUMA node, the first resource corresponding to the first I/O object.
  • the invention in general, in one aspect, relates to a method for binding input/output (I/O) objects to nodes.
  • the method includes a Network Media Access Connection (MAC) Layer receiving a request to open a network connection from a process, determining a thread to service the request, generating a first I/O object corresponding to the thread, wherein the first I/O object is unbound, and sending the first I/O object to a Non-Uniform Memory Access (NUMA) I/O Framework.
  • NUMA Non-Uniform Memory Access
  • the Network MAC Layer is associated with a physical network interface card (NIC).
  • the method further includes the NUMA I/O Framework selecting a first NUMA node to which to bind the first I/O object, and binding the first I/O object to the first NUMA node.
  • the method further includes servicing the request by executing, on the first NUMA node, the thread corresponding to the first I/O object.
  • FIG. 1 shows a system in accordance with one or more embodiments of the invention.
  • FIG. 2 shows a NUMA node in accordance with one or more embodiments of the invention.
  • FIG. 3 shows an I/O Device Group in accordance with one or more embodiments of the invention.
  • FIG. 4 shows a system in accordance with one or more embodiments of the invention.
  • FIG. 5 shows an I/O Topology Module in accordance with one or more embodiments of the invention.
  • FIG. 6 shows a Locality Group Module in accordance with one or more embodiments of the invention.
  • FIG. 7 shows a Load Balancing Module in accordance with one or more embodiments of the invention.
  • FIG. 8 shows an I/O Object Group in accordance with one or more embodiments of the invention.
  • FIG. 9 shows a flowchart in accordance with one or more embodiments of the invention.
  • FIG. 10 shows a flowchart in accordance with one or more embodiments of the invention.
  • FIG. 1 1 shows a flowchart in accordance with one or more embodiments of the invention.
  • FIG. 12A shows an example system in accordance with one or more embodiments of the invention.
  • FIG. 12B shows an example timeline in accordance with one or more embodiments of the invention.
  • FIG. 13 shows a system in accordance with one or more embodiments of the invention.
  • embodiments of the invention relate to a framework for managing input/output (I/O) resources on a system with a non-uniform memory access (NUMA) architecture. More specifically, embodiments of the invention relate to a method and system for creating an abstraction layer between nodes on a NUMA system, and I/O resources connected to the system.
  • I/O input/output
  • NUMA non-uniform memory access
  • FIG. 1 shows a system in accordance with one embodiment of the invention.
  • the system includes Node A (100A), Node B (100B), Node C (lOOC), and Node N (100N).
  • Each Node (Node A (100A), Node B (100B), Node C (lOOC), and Node N (100N)) is operatively connected to one or more other Nodes via an interconnect (IC) (IC A (102 A), IC B (102B), IC C (102C), IC N (102N)).
  • IC interconnect
  • Each Node (Node A (100A), Node B (100B), Node C (lOOC), and Node N (100N)) is also operatively connected to one or more I/O Device Groups (I/O Device Group A (104A), I/O Device Group D (104D), I/O Device Group C (104C), I/O Device Group B (104B), I/O Device Group E (104E), I/O Device Group N (104N)) (see FIG. 3).
  • the system further includes I/O Subsystems (106) and NUMA I/O Framework (108).
  • the system architecture depicted in FIG. 1 may operate as a system with NUMA architecture.
  • the ICs may be implemented as a computer bus or data link capable of transferring data between nodes on a NUMA architecture system.
  • I/O Subsystems (106) provide an abstraction layer between system processes and the various system I/O functions. Specifically, I/O Subsystems (106) may exercise an amount of control over how the software entities utilizing the framework communicate with each other, and may include mechanisms to further other system goals (e.g., power management, consumer priority, etc.). Examples of I/O Subsystems (e.g., I/O Subsystems (106)) include, but are not limited to, a storage stack, InfiniBand ULP (InfiniBand is a registered trademark of the InfiniBand Trade Association), and a Network MAC Layer.
  • I/O Subsystems (106) include, but are not limited to, a storage stack, InfiniBand ULP (InfiniBand is a registered trademark of the InfiniBand Trade Association), and a Network MAC Layer.
  • each I/O Subsystem receives requests from other software entities to use or access its associated I/O device.
  • each I/O Subsystem includes the functionality to manage the I/O resources necessary to service the requests.
  • the I/O managed resources may include, for example, threads, interrupts, and software receive rings.
  • the I/O Subsystems (106) may manage its associated resources by initializing an I/O Object corresponding to the managed resource, (see FIG. 8). Further, the I/O resources managed by one or more I/O Subsystems (106) may exist or execute on a single Node (e.g. Node A (100A)), on multiple Nodes (e.g.
  • Each of the I/O Subsystems (106) may also execute on a single Node (e.g. Node A (100A)), on multiple Nodes (e.g. Node A (100A) and Node B (100B)), or on all Nodes within a single system.
  • the I/O Subsystems (106) and the NUMA I/O Framework (108) are depicted in FIG. 1 as external to the other elements on the system for illustrative purposes.
  • the NUMA I/O Framework [0025] In one or more embodiments of the invention, the NUMA I/O Framework
  • the NUMA I/O Framework (108) is an abstraction layer between the I/O Subsystems (106) and the underlying NUMA architecture (e.g., the system depicted in FIG. 1).
  • the NUMA I/O Framework (108) assumes all responsibility for determining where and how I/O Objects (e.g., references to I/O resources) are processed.
  • the NUMA I/O Framework manages the physical location of the I/O resources managed by the I/O Subsystems (106).
  • the NUMA I/O Framework determines the placement of an I/O Object using information-gathering modules or policies implemented to further system goals (see FIGs. 4-7).
  • the NUMA I/O Framework [0026] In one or more embodiments of the invention, the NUMA I/O Framework
  • binding an I/O resource to a node may include notifying a kernel scheduler that the instructions associated with the I/O resource are to be executed on the node or nodes to which it is bound.
  • the instructions or messages originating from that I/O resource are scheduled for execution on the node to which it is bound until there is further intervention by the NUMA I/O Framework (108).
  • an I/O resource may be bound to a subset of nodes (e.g., via an I/O Object).
  • the NUMA I/O Framework (108) may provide the kernel scheduler information about the subset of nodes as part of binding the I/O resource. The kernel scheduler may then chose which one of the subset of nodes on which the instructions or messages are scheduled for execution.
  • FIG. 2 shows a node in accordance with one embodiment of the invention.
  • Node A (200A) is operatively connected to Node B (200B) and Node N (200N) via the ICs (IC A (202A), IC N (202N)).
  • Node A (200A) includes a central processing unit (CPU) (204) and Cache (206) connected to a memory (208), via a Bus (210).
  • CPU central processing unit
  • Cache CPU
  • Each of the other nodes in the system (Node B (200B), Node N (200C)) may include substantially similar elements as those depicted in Node A (200A).
  • the memory (208) includes local application memory and local kernel memory.
  • a portion of the local kernel memory may be allocated for use by system- wide software elements (e.g. , I/O Subsystems, NUMA I/O Framework, etc.).
  • the memory (208) is under the control of a memory manager specific to the CPU (204) on Node A (200A), and the memory of Node B (200B) (not shown) is under the control of a memory manager specific to the CPU of Node B (200B) (not shown).
  • the above-described architecture may operate more efficiently than an architecture where all CPUs are competing for memory from a single memory manager.
  • Other embodiments of the invention may be implemented on system architectures other than those described above.
  • each node (Node A (200A)
  • Node B (200B), Node N (200B)) may be operatively connected to one or more I/O Device Groups. As depicted in FIG. 2, Node A (200A) is operatively connected to one or more I/O Device Groups (IO Device Group A (212 A), I/O Device Group N (212N)). In one embodiment of the invention, one or more of the I/O Device Groups (e.g., I/O Device Group A (212 A), I/O Device Group N (212N)) may be connected to one or more nodes via an IC.
  • a NUMA node may include a
  • a NUMA node may include a memory (e.g., memory (208)) but not include a CPU.
  • FIG. 3 shows an I/O Device Group in accordance with one embodiment of the invention.
  • the I/O Device Group (300) includes one or more I/O devices (IO Device A (302 A), I/O Device N (302N)) operatively connected to I/O Bus (304), which is, in turn, operatively connected to I/O Bridge (306).
  • I/O Bridge (306) is operatively connected to one or more nodes (Node A (308 A), Node N (308N)) (e.g., Node A (100A) in FIG. 1).
  • the I/O devices (IO Device A (302A), I/O
  • Device N refers to resources connected to the computer system, which may be used by programs executing on the system for information input and/or information output. Examples of such devices may include, but are not limited to, disk drives, network interface cards, printers, Universal Serial Buses (USBs), etc. One of ordinary skill in the art will appreciate there are other I/O devices not listed here.
  • FIG. 4 shows a system in accordance with one embodiment of the invention.
  • FIG. 4 shows the interaction between software entities executing on one or more nodes (e.g., Node A (200A), Node B (200B), and Node N (200N) in FIG. 1) of a system in accordance with one embodiment of the invention.
  • the system includes the NUMA I/O Framework (400), which communicates with the I/O Subsystem (402) directly, or via the Kernel Affinity API (404).
  • the I/O Subsystem (402) facilitates communication between the Consumer (406) and the I/O Device (408) (via the Device Driver (410)).
  • the I/O Subsystem may also receive I/O Object constrains or restriction information from the Administration Tool (412).
  • the NUMA I/O Framework [0034] In one or more embodiments of the invention, the NUMA I/O Framework
  • the I/O resources may include I/O Devices (e.g., I/O Device (408)), processing resources (e.g., CPU (204) and memory (208) in FIG. 2), as well as other system elements which facilitate communication between a process and an I/O Device (e.g., interrupts, receive rings, listeners, etc.), and may include physical or virtual elements.
  • I/O Devices e.g., I/O Device (408)
  • processing resources e.g., CPU (204) and memory (208) in FIG. 2
  • other system elements which facilitate communication between a process and an I/O Device e.g., interrupts, receive rings, listeners, etc.
  • the I/O Subsystem (402) manages the I/O resources necessary to service requests to access the I/O Device (408) received from a Consumer (406). Such requests may include calls to open a connection to the I/O Device (408), or otherwise access the I/O Device (408) via the appropriate I/O Subsystem (402) may also include the functionality to initialize or instantiate an I/O Object, and associate the I/O Object with an I/O resource.
  • the I/O Subsystem (402) may create an I/O Object which includes a reference to an I/O resource, which may then be provided to the NUMA I/O Framework (400) as part of a request to bind an I/O resource (see FIG. 8).
  • the NUMA I/O Framework (400) receives I/O Objects from the I/O Subsystem (402).
  • the I/O Objects may be received via the Kernel Affinity API (404), which provides an interface for the I/O Subsystem (402) to register I/O Objects with the NUMA I/O Framework (400).
  • I/O Objects registered with the NUMA I/O Framework (400) may include information regarding the grouping of the I/O Objects, an affinity between the I/O Objects, and any constraints associated with the I/O Objects.
  • the NUMA I/O Framework (400) uses the affinity to determine an appropriate node or nodes to an I/O Object should be bound, (e.g., nodes that are physically close to one another, a nodes that are physically close to an specified I/O Device, etc.).
  • I/O Objects are sent to the NUMA I/O Framework (400) in one or more I/O Object Groups (see FIG. 8).
  • the NUMA I/O Framework (400) binds the I/O Objects to nodes.
  • binding an I/O Object refers to assigning the tasks issued by the I/O resource referenced by the I/O Object (e.g., handling an interrupt, executing a thread) to one or more nodes on the system.
  • the NUMA I/O Framework (400) uses the information within the I/O Object (e.g. , affinity), along with information from and functionality of other modules on system to accomplish the binding.
  • the Load Balancing Module (416), the Locality Group Module (418), and the I/O Topology Module (420) are discussed below with regard to FIGs. 5, 6, and 7, respectively.
  • the NUMA I/O Framework (400) may bind I/O Objects according to one or more objectives.
  • the NUMA I/O Framework (400) may bind I/O Objects in order to maximize the performance of the entire system.
  • the NUMA I/O Framework (400) may bind I/O Objects in a manner which makes the most efficient use of system resources.
  • the NUMA I/O Framework (400) may also bind I/O Objects to maximize the speed at which one or all processes are executed.
  • the NUMA I/O Framework (400) may bind I/O Objects in a manner which minimizes the distance between the I/O Devices being used, and the Node bound to the associated I/O Objects.
  • the NUMA I/O Framework (400) allows the NUMA I/O Framework (400) to allocate kernel memory from any of the attached nodes (e.g. , from memory (208) in Node A (200A) in FIG. 2).
  • the Load Balancing Module (416) monitors the amount of work performed by each node, and dynamically balances the work between nodes, taking into account resource management and I/O Topology (/ ' . e. , the location of the nodes relative to one another).
  • the amount of work or rate of processing done by a system node is referred to as the node's I/O load.
  • FIG. 5 shows an I/O Topology Module in accordance with one embodiment of the invention.
  • the I/O Topology Module (500) includes one or more I/O Device Records (I/O Device Record A (502A), I/O Device Record N (502N)).
  • the I/O Topology Module (500) uses information gathered from the I/O Subsystems (e.g., I/O Subsystem (402) in FIG. 4) to create an I/O Device Record for each I/O Device on the system (e.g. , I/O Device (408) in FIG. 4).
  • Each I/O Device Record e.g.
  • I/O Device Record A (502A), I/O Device Record N (502N)) includes information indicating which system nodes are directly connected to the I/O Device.
  • the I/O Device Record is created and maintained by other kernel elements on the system accessible by the I/O Topology Module (500). Information regarding the location of each I/O Device on the system may be referred to as the I/O Topology.
  • the I/O Topology Module (500) includes the functionality to respond to queries by the NUMA I/O Framework such that for a given I/O Device, the I/O Topology Module (500) returns the node or nodes directly connected to that I/O Device. In one embodiment of the invention, these nodes are referred to as the Preferred Nodes.
  • FIG. 6 shows a Locality Group Module in accordance with one embodiment of the invention.
  • the Locality Group Module (600) includes one or more Locality Groups (e.g., Node A Locality Group (602 A), Node N Locality Group (602N)).
  • Each Locality Group maintains information about a node on the system. This information may include the location of the node relative to the other nodes on the system (i.e., which nodes are directly adjacent to the node). Information regarding the location of each node on the system may be referred to as the NUMA Topology.
  • the distance between Nodes or I/O Devices refers to the physical distance between the two elements.
  • the distance may refer to the number of Nodes between the two elements (also referred to as hops). Further, in one embodiment of the invention, the distance between nodes may be expressed in terms of the time necessary for data to travel from one node to another (also referred to as the latency between nodes).
  • the Locality Group Module In one or more embodiments of the invention, the Locality Group Module
  • the Locality Group Module (600) includes the functionality to respond to queries by the NUMA I/O Framework such that for a given Node, the Locality Group Module (600) returns the node or nodes directly connected to that Node. In one embodiment of the invention, these nodes are referred to as the Preferred Nodes.
  • FIG. 7 shows a Load Balancing Module in accordance with one embodiment of the invention.
  • the Load Balancing Module (700) includes one or more Load Monitors (e.g., Node A Load Monitor (702 A), Node N Load Monitor (702N)).
  • Each Load Monitor e.g., Node A Load Monitor (702A), Node N Load Monitor (702N)
  • each Load Monitor (e.g., Node A Load Monitor (702A), Node N Load Monitor (702N)) obtains periodic measurements of specified metrics (e.g., CPU utilization, memory utilization, etc.), and uses the measurements to calculate an I/O load for the node.
  • specified metrics e.g., CPU utilization, memory utilization, etc.
  • the I/O load includes indicators reflective of trending direction of the measured metrics (e.g., increasing CPU utilization over the last 10 cycles).
  • each Load Monitor (e.g., Node A Load Monitor (702A), Node N Load Monitor (702N)) includes functionality to track metrics over time and detect patterns in the I/O load (e.g., CPU utilization is greatest on Monday afternoons between 2pm and 5pm).
  • the I/O load is also used to calculate a node I/O load capacity.
  • FIG. 8 shows an I/O Object Group in accordance with one embodiment of the invention.
  • the I/O Object Group (800) includes one or more I/O Objects (e.g., I/O Object A (802A), I/O Object N (802N)).
  • an I/O Object is a software construct which encapsulates a reference or handle to a corresponding I/O resource.
  • Each I/O Object may also include one or more affinities with other I/O Objects, a constraint on the binding of the I/O object, and a Dedicate CPU Flag.
  • an affinity is a scalar indication of the strength of the relationship between I/O Objects (e.g. , no relationship, weak relationship, strong relationship, negative relationship, etc.).
  • the affinity between two I/O Objects defines the maximum or minimum permitted distance between the nodes to which the I/O Objects may or should be bound.
  • the affinity is specified by the I/O Subsystem managing the I/O Object.
  • the I/O Subsystem creates an affinity between I/O Objects (e.g. , I/O Object A (802A), I/O Object N (802N)) corresponding to I/O resources, which work together to perform part of an I/O operation.
  • I/O Objects e.g. , I/O Object A (802A), I/O Object N (802N)
  • an I/O Object corresponding to an interrupt for traffic received by a virtual network interface card may have a strong affinity to other I/O Objects corresponding to other interrupts and threads processing data on the same virtual network interface card.
  • a constraint may specify a node or group of nodes upon which an I/O Object or I/O Object Group must be bound.
  • a constraint may be used to confine an I/O Object or I/O Object Group to an approved or appropriate set of nodes.
  • Constraints may be used to isolate one I/O Object or I/O Object Group from another.
  • constraints may be used by the I/O Subsystem to enforce the separation of zones or containers on a system.
  • a Dedicate CPU Flag may indicate that the I/O Object should be bound to a node with a CPU available to dedicate to the I/O Object.
  • the Dedicate CPU Flag may be interpreted by the NUMA I/O Framework as an absolute restriction, or alternatively as a preference.
  • the Dedicate CPU Flag may include other information indicating the strength of the preference.
  • I/O Objects may be submitted to the NUMA I/O Framework as an I/O Object Group (800).
  • An I/O Object Group (800) may include affinities or constraints that apply to all I/O Objects within the I/O Object Group (800).
  • the NUMA I/O Framework may apply affinities or constraints inherent to all I/O Objects within an I/O Object Group (800).
  • FIG. 9 shows a flow chart for registering a new I/O Device with a NUMA I/O
  • one or more of the steps shown in FIG. 9 may be omitted, repeated, and/or performed in a different order than that shown in FIG. 9. Accordingly, the specific arrangement of steps shown in FIG. 9 should not be construed as limiting the scope of the invention.
  • Step 910 the I/O Topology Module detects (or is otherwise notified of) the attachment of a new I/O Device to the system.
  • Step 912 the I/O Topology Module creates a new I/O Device Record.
  • Step 914 the I/O Topology Module adds the new I/O Device information to the I/O Device Record.
  • Step 916 the I/O Topology Module obtains location information for the new I/O Device from the Locality Group Module, or from other system resources (e.g., BIOS, machine description, etc.). This information may include the closest nodes to the I/O Device, which are not directly connected to the I/O Device.
  • Step 918 the I/O Topology Module updates the I/O Device Record using the location information obtained from the Locality Group Module.
  • FIG. 10 shows a flow chart for servicing a request by an I/O Subsystem in accordance with one or more embodiments of the invention.
  • one or more of the steps shown in FIG. 10 may be omitted, repeated, and/or performed in a different order than that shown in FIG. 10. Accordingly, the specific arrangement of steps shown in FIG. 10 should not be construed as limiting the scope of the invention.
  • Step 1010 a process sends a request to the I/O Subsystem to use an I/O
  • the request may be, for example, a request to create a data link associated with a network interface card. Alternatively, the request may be to gain access to a storage device in order to alter data located on that device. Other examples of incoming requests include requests from a network stack (e.g., to create a VNIC), and requests from a file system.
  • the I/O Subsystem determines resources necessary to service the request. This may include, for example, a specific number of threads and a specific number of interrupts. In one embodiment of the invention, this determination is based on the requirements of similar requests previously serviced. In one embodiment of the invention, the determined resources may change over time as usage information is analyzed.
  • an I/O Subsystem which creates a connection between a process and a physical network may be configured to create a specified number of I/O Objects for threads, and a specified number of I/O Objects for interrupts for connections of the type created.
  • the I/O Subsystem may further be configured to specify that the threads should not execute on separate nodes, because doing so may cause an unacceptable amount of slowness or data loss for the connection. For this reason, the I/O Subsystem may express this by specifying a strong affinity between the I/O Objects.
  • Step 1014 the I/O Subsystem creates I/O Objects for the necessary resources.
  • Step 1016 the I/O Subsystem sends the I/O Objects to the NUMA I/O Framework.
  • the I/O Objects are created by invoking method call of the Affinity Kernel API.
  • Step 1018 the I/O Subsystem specifies an affinity between the I/O Objects for use by the NUMA I/O Framework.
  • the NUMA I/O Framework binds the I/O Objects to a node based on a policy and the affinity. Step 1020 is explained in detail with regard to FIG. 11.
  • FIG. 11 shows a flow chart for binding an I/O Object by a NUMA I/O
  • one or more of the steps shown in FIG. 11 may be omitted, repeated, and/or performed in a different order than that shown in FIG. 11. Accordingly, the specific arrangement of steps shown in FIG. 9 should not be construed as limiting the scope of the invention.
  • Step 11 10 the NUMA I/O Framework receives a request to bind an I/O
  • the NUMA I/O Framework obtains the I/O Object affinities for each I/O Object in the I/O Object Group. In one embodiment of the invention, an affinity is presumed between all I/O Objects in an I/O Object Group.
  • the NUMA I/O Framework determines I/O Object Group constraints. In one embodiment of the invention, the affinities and constraints are imbedded in the received I/O Object.
  • the NUMA I/O Framework determines Node Selection Requirements using the information about the I/O Object affinities and constraints, along with any other restrictions or indications obtained regarding the I/O Objects (including the existence of a Dedicate CPU Flag).
  • the Node Selection Requirements specify a set of conditions that a node or set of nodes must satisfy to be considered for binding the I/O Object Group. Such conditions may include a specific arrangement of nodes within a set distance from an I/O Device. In one embodiment of the invention, the conditions may include the I/O load capacity of each node.
  • Step 1118 the NUMA I/O Framework uses the Node Selection
  • a Primary Preferred NUMA Node Set is a node or group of nodes that satisfy all of the Node Selection requirements.
  • the Node Selection Requirements may only be satisfied by more than one node. For example, if one I/O Object in an I/O Object Group has a Dedicate CPU Flag, and as such no other object in the I/O Object Group may be placed on the same node, the Node Selection Requirements would necessarily require the use of more than one node. Therefore, the node or combination of nodes which satisfy the Node Selection Requirements may be referred to as a Node Set. Similarly, a Node Set may include only a single node or a combination of nodes.
  • a NUMA node set is determined to be a
  • the NUMA I/O Framework determines whether there is more than one Primary Preferred NUMA Node Set upon which the I/O Objects in the I/O Object Group may be bound. In one embodiment of the invention, there may be more than one Primary Preferred NUMA Node Sets when more than one NUMA Node set satisfies the Node Selection Requirements and each are an equivalent physical distance from the associated I/O Device.
  • one of the Primary Preferred NUMA Node Sets is selected based on a selection policy.
  • a selection policy specifies that one Primary Preferred NUMA Node Set is selected at random.
  • the selection policy may further system goals independent of the system goals used to determine the Primary Preferred NUMA Node Sets.
  • Step 1124 the NUMA I/O Framework determines whether there is one Primary Preferred NUMA Node Set. If there is one Primary Preferred NUMA Node Set, then in Step 1126, that Primary Preferred NUMA Node Set is selected. If there is no Primary Preferred NUMA Node Set, then in Step 1128, the NUMA I/O Framework determines a Secondary Preferred NUMA Node Set based on the Node Selection Requirements using the Locality Group Module. Specifically, in one embodiment of the invention, the NUMA I/O Framework queries the Locality Group Module to determine the node or set of nodes closest to the Primary Preferred NUMA Node Set. The Secondary Preferred NUMA Node Set is the node or nodes which satisfy the Node Selection Requirements, and are in the second-best position to process the I/O Objects in the I/O Object Group.
  • the system waits until one of the initially determined Primary Preferred NUMA Node Sets becomes available.
  • the NUMA I/O Framework may bind the I/O Object Group to a node set which does not satisfy all of the Node Selection Requirements. For example, if one I/O Object in the I/O Object Group includes a Dedicate CPU Flag, the NUMA I/O Framework may determine that all I/O Objects in the I/O Object Group may be bound to the same node, despite the existence of the Dedicate CPU Flag.
  • Step 1130 the Secondary Preferred NUMA Node Set is promoted to the
  • Step 1132 the I/O Object or I/O Objects in the I/O Object Group is bound to the selected Primary Preferred NUMA Node Set.
  • FIGs. 12A and 12B show an exemplary system and timeline in accordance with one embodiment of the invention.
  • the system includes Node A (1200A), Node B (1200B), Node C (1200C), and Node D (1200D).
  • Node A (1200A) is connected to Node B (1200B) via IC A (1202A), and to Node C (1200C) via IC B (1202B).
  • Node B (1200B) is connected to Node A (1200A) via IC A (1202A), and to Node D (1200D) via IC C (1202C).
  • Node C (1200C) is connected to Node A (1200A) via IC B (1202B), and to Node D (1200D) via IC D (1202D).
  • Node D (1200D) is connected to Node B (1200B) via IC C (1202C), and to Node C (1200C) via IC D (1202D).
  • Node A (1200 A) is operatively connected to I/O Device Group A (1204A), and Node B (1200B) is operatively connected to I/O Device Group B (1204B). Additionally, Node C (1200C) and Node D (1200D) are both operatively connected to I/O Device C (1204C). I/O Device C (1204C) includes a physical network interface card (NIC) (1206).
  • NIC physical network interface card
  • FIG. 12B shows an timeline of an exemplary interaction between the elements in FIG. 12B.
  • the timeline in FIG. 12B depicts a process executing on Node A requesting that a network connection be created between the process and a receiving system.
  • the receiving system is external to the exemplary system of FIG. 12 A, such that a network connection requires a physical NIC.
  • the I/O Subsystem (1208) in FIG. 12B is an I/O Subsystem responsible for establishing network connections between user- and kernel- level processes and network destinations (e.g., a network MAC layer).
  • Step 1220 a process on Node A (1200A) sends a request to the I/O
  • Step 1222 the I/O Subsystem (1208) selects NIC (1206) for use in establishing and connection, and determines the I/O resources necessary to open a connection between the process on Node A (1200 A) and NIC (1206). For the purposes of this example, assume that the I/O Subsystem determines that one thread and one interrupt are the necessary I/O resources.
  • Step 1224 the I/O Subsystem (1208) creates an I/O Object for the thread, and an I/O Object for the interrupt.
  • the I/O Subsystem (1208) sends the I/O Objects, as an I/O Group, to the NUMA I/O Framework (1210), and specifies a constraint on the I/O Object Group such that no I/O Object within the I/O Object Group may be placed on Node D (1200D).
  • the I/O Subsystem (1208) also specifies the affinity between the I/O Objects such that the I/O Objects should be placed on the same node, and notifies the NUMA I/O Framework (1210) of the affinity.
  • Step 1228 the NUMA I/O Framework (1210) determines the Node
  • the Node Selection Requirements determined by the NUMA I/O Framework (1210) details that the selected node must be capable of executing two I/O Objects, and must not be Node D (1220D).
  • the NUMA I/O Framework (1210) queries the I/O Topology Module (1212) to determine the node or nodes closest to NIC (1206).
  • the I/O Topology Module (1212) responds (not shown) that Node C (1200C) and Node D (1200D) are directly connected to I/O Device Group C (1204C) and NIC (1206). Therefore, Node C (1200C) and Node D (1200D) are the Primary Preferred Nodes.
  • Step 1232 The NUMA I/O Framework (1210) applies the Node Selection
  • the NUMA I/O Framework (1210) determines that Node D (1200D) may not be selected because of the constraints on the I/O Object Group. Assume for the purposes of the example, that Node C (1200C) is incapable of executing both a thread and an interrupt. Therefore, the NUMA I/O Framework (1210) determines that no Primary Preferred NUMA Node Sets are available.
  • Step 1234 the NUMA I/O Framework (1210) queries the Locality Group
  • the Locality Group Module (1214) responds (not shown) notifying the NUMA I/O Framework (1210) that Node A (1200A) and Node B (1200B) are directly connected to the Primary Preferred Nodes (Node C (1200C) and Node D (1200D)). Therefore, Node A (1200A) and Node B (1200B) are the Secondary Preferred Nodes.
  • Step 1236 The NUMA I/O Framework (1210) applies the Node Selection
  • Step 1238 the NUMA I/O Framework (1210) determines that two Primary
  • NUMA Node Sets are available. Assume that the selection policy dictates that the node closest to the calling process is selected, and if they are both equal distant from the calling process, one node is selected at random.
  • the NUMA I/O Framework (1210) applies the selection policy to the Primary Preferred Node Sets, and selects Node A (1200A) as closest to the calling process.
  • Step 1240 the NUMA I/O Framework (1210) binds the I/O Objects to
  • Step 1242 the Kernel Scheduler (1216) directs instructions associated with the I/O Objects to be processed on Node A (1200A).
  • Embodiments of the invention may be implemented on virtually any type of computer implementing a NUMA architecture (1300) (or equivalent).
  • a networked computer system including two or more processors (1302), associated memory (1304), a storage device (1306), two or more I/O devices (not shown) and numerous other elements and functionalities typical of today's computers.
  • the networked computer may also include input means, such as a keyboard (1308) and a mouse (1310), and output means, such as a monitor (1312).
  • the networked computer system is connected to a local area network (LAN) or a wide area network via a network interface connection.
  • LAN local area network
  • a wide area network via a network interface connection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Computer And Data Communications (AREA)
  • Stored Programmes (AREA)
  • Multi Processors (AREA)
  • Mobile Radio Communication Systems (AREA)
PCT/US2011/049852 2010-09-17 2011-08-31 Numa i/o framework WO2012036898A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP11755204.2A EP2616934B1 (en) 2010-09-17 2011-08-31 Numa i/o framework
CN201180052399.2A CN103201722B (zh) 2010-09-17 2011-08-31 将输入/输出对象绑定到节点的系统和方法

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US38412010P 2010-09-17 2010-09-17
US61/384,120 2010-09-17
US13/076,715 2011-03-31
US13/076,715 US8725913B2 (en) 2010-09-17 2011-03-31 Numa I/O framework

Publications (1)

Publication Number Publication Date
WO2012036898A1 true WO2012036898A1 (en) 2012-03-22

Family

ID=44645228

Family Applications (4)

Application Number Title Priority Date Filing Date
PCT/US2011/049852 WO2012036898A1 (en) 2010-09-17 2011-08-31 Numa i/o framework
PCT/US2011/050748 WO2012036961A1 (en) 2010-09-17 2011-09-08 Using process location to bind io resources on numa architectures
PCT/US2011/050746 WO2012036959A1 (en) 2010-09-17 2011-09-08 Dynamic balancing of io resources on numa platforms
PCT/US2011/050747 WO2012036960A1 (en) 2010-09-17 2011-09-08 Dynamic creation and destruction of io resources based on actual load and resource availability

Family Applications After (3)

Application Number Title Priority Date Filing Date
PCT/US2011/050748 WO2012036961A1 (en) 2010-09-17 2011-09-08 Using process location to bind io resources on numa architectures
PCT/US2011/050746 WO2012036959A1 (en) 2010-09-17 2011-09-08 Dynamic balancing of io resources on numa platforms
PCT/US2011/050747 WO2012036960A1 (en) 2010-09-17 2011-09-08 Dynamic creation and destruction of io resources based on actual load and resource availability

Country Status (4)

Country Link
US (4) US8725912B2 (zh)
EP (4) EP2616934B1 (zh)
CN (4) CN103201722B (zh)
WO (4) WO2012036898A1 (zh)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8522251B2 (en) * 2011-01-10 2013-08-27 International Business Machines Corporation Organizing task placement based on workload characterizations
WO2012096963A1 (en) * 2011-01-10 2012-07-19 Fiberlink Communications Corporation System and method for extending cloud services into the customer premise
US8793459B2 (en) * 2011-10-31 2014-07-29 International Business Machines Corporation Implementing feedback directed NUMA mitigation tuning
US9047417B2 (en) * 2012-10-29 2015-06-02 Intel Corporation NUMA aware network interface
US10019167B2 (en) * 2013-02-20 2018-07-10 Red Hat, Inc. Non-Uniform Memory Access (NUMA) resource assignment and re-evaluation
US10684973B2 (en) 2013-08-30 2020-06-16 Intel Corporation NUMA node peripheral switch
US9274835B2 (en) 2014-01-06 2016-03-01 International Business Machines Corporation Data shuffling in a non-uniform memory access device
US9256534B2 (en) 2014-01-06 2016-02-09 International Business Machines Corporation Data shuffling in a non-uniform memory access device
US10255091B2 (en) * 2014-09-21 2019-04-09 Vmware, Inc. Adaptive CPU NUMA scheduling
WO2016118162A1 (en) * 2015-01-23 2016-07-28 Hewlett Packard Enterprise Development Lp Non-uniform memory access aware monitoring
US11275721B2 (en) * 2015-07-17 2022-03-15 Sap Se Adaptive table placement in NUMA architectures
KR20170094911A (ko) * 2016-02-12 2017-08-22 삼성전자주식회사 반도체 장치의 동작 방법 및 반도체 시스템
US10142231B2 (en) * 2016-03-31 2018-11-27 Intel Corporation Technologies for network I/O access
CN109254933A (zh) * 2018-09-25 2019-01-22 郑州云海信息技术有限公司 一种io请求的处理方法、系统及相关组件
WO2020091836A1 (en) * 2018-10-31 2020-05-07 Middle Chart, LLC System for conducting a service call with orienteering
US11003585B2 (en) * 2019-03-07 2021-05-11 International Business Machines Corporation Determining affinity domain information based on virtual memory address
US11144226B2 (en) * 2019-04-11 2021-10-12 Samsung Electronics Co., Ltd. Intelligent path selection and load balancing
CN109918027B (zh) * 2019-05-16 2019-08-09 上海燧原科技有限公司 存储访问控制方法、装置、设备及存储介质
US11216190B2 (en) 2019-06-10 2022-01-04 Samsung Electronics Co., Ltd. Systems and methods for I/O transmissions in queue pair-based NVMeoF initiator-target system
US11240294B2 (en) 2019-08-23 2022-02-01 Samsung Electronics Co., Ltd. Systems and methods for spike detection and load balancing resource management
KR20210127565A (ko) * 2020-04-14 2021-10-22 삼성전자주식회사 가상 머신에 자원을 할당하는 방법 및 장치
WO2022056798A1 (en) * 2020-09-18 2022-03-24 Intel Corporation Improving remote traffic performance on cluster-aware processors
CN114780463A (zh) * 2022-03-01 2022-07-22 阿里巴巴(中国)有限公司 中断控制方法、设备、分布式系统及存储介质
CN116820687B (zh) * 2023-08-29 2023-12-05 银河麒麟软件(长沙)有限公司 基于kubelet的NUMA架构资源分配方法及系统

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080092138A1 (en) 2003-03-31 2008-04-17 International Business Machines Corporation Resource allocation in a numa architecture based on separate application specified resource and strength preferences for processor and memory resources

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6813769B1 (en) 1997-10-28 2004-11-02 Microsoft Corporation Server application components with control over state duration
US6026425A (en) 1996-07-30 2000-02-15 Nippon Telegraph And Telephone Corporation Non-uniform system load balance method and apparatus for updating threshold of tasks according to estimated load fluctuation
US6434656B1 (en) 1998-05-08 2002-08-13 International Business Machines Corporation Method for routing I/O data in a multiprocessor system having a non-uniform memory access architecture
WO2000019326A1 (fr) * 1998-09-29 2000-04-06 Fujitsu Limited Procede et dispositif de traitement de demandes d'acces
US6769017B1 (en) * 2000-03-13 2004-07-27 Hewlett-Packard Development Company, L.P. Apparatus for and method of memory-affinity process scheduling in CC-NUMA systems
US7159036B2 (en) 2001-12-10 2007-01-02 Mcafee, Inc. Updating data from a source computer to groups of destination computers
US7761873B2 (en) * 2002-12-03 2010-07-20 Oracle America, Inc. User-space resource management
US7296269B2 (en) 2003-04-22 2007-11-13 Lucent Technologies Inc. Balancing loads among computing nodes where no task distributor servers all nodes and at least one node is served by two or more task distributors
US20050091383A1 (en) 2003-10-14 2005-04-28 International Business Machines Corporation Efficient zero copy transfer of messages between nodes in a data processing system
EP1564638B1 (en) 2004-02-10 2008-02-20 Sap Ag A method of reassigning objects to processing units
US7343379B2 (en) * 2004-05-21 2008-03-11 Bea Systems, Inc. System and method for controls
US8135731B2 (en) 2004-12-02 2012-03-13 International Business Machines Corporation Administration of search results
US7302533B2 (en) * 2005-03-11 2007-11-27 International Business Machines Corporation System and method for optimally configuring software systems for a NUMA platform
CN100356325C (zh) * 2005-03-30 2007-12-19 中国人民解放军国防科学技术大学 大规模并行计算机系统分组并行启动方法
US8037465B2 (en) * 2005-09-30 2011-10-11 Intel Corporation Thread-data affinity optimization using compiler
US7793301B2 (en) 2005-11-18 2010-09-07 Samsung Electronics Co., Ltd. Method and system for providing efficient object-based network management
US7755778B2 (en) * 2006-03-30 2010-07-13 Xerox Corporation Print job management system
US7941805B2 (en) 2006-08-15 2011-05-10 International Business Machines Corporation Affinity dispatching load balancer with precise CPU consumption data
US8255577B2 (en) 2007-04-26 2012-08-28 Hewlett-Packard Development Company, L.P. I/O forwarding technique for multi-interrupt capable devices
US8156495B2 (en) 2008-01-17 2012-04-10 Oracle America, Inc. Scheduling threads on processors
US8286198B2 (en) 2008-06-06 2012-10-09 Apple Inc. Application programming interfaces for data parallel computing on multiple processors
US8225325B2 (en) 2008-06-06 2012-07-17 Apple Inc. Multi-dimensional thread grouping for multiple processors
JP4772854B2 (ja) 2008-12-02 2011-09-14 株式会社日立製作所 計算機システムの構成管理方法、計算機システム及び構成管理プログラム
US8438284B2 (en) 2009-11-30 2013-05-07 Red Hat, Inc. Network buffer allocations based on consumption patterns
US8346821B2 (en) 2010-05-07 2013-01-01 International Business Machines Corporation Orphan object tracking for objects having acquire-release semantics

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080092138A1 (en) 2003-03-31 2008-04-17 International Business Machines Corporation Resource allocation in a numa architecture based on separate application specified resource and strength preferences for processor and memory resources

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUNAI TRIPATHI, NICOLAS DROUX, THIRUMALAI SRINIVASAN: "Crossbow: From Hardware Virtualized NICs to Virtualized Networks", 17 August 2009 (2009-08-17), Barcelona, Spain, XP002662358, Retrieved from the Internet <URL:http://conferences.sigcomm.org/sigcomm/2009/workshops/visa/papers/p53.pdf> [retrieved on 20111027] *

Also Published As

Publication number Publication date
WO2012036959A1 (en) 2012-03-22
WO2012036961A1 (en) 2012-03-22
US8996756B2 (en) 2015-03-31
US8782657B2 (en) 2014-07-15
CN103201722B (zh) 2017-03-01
US20120072627A1 (en) 2012-03-22
US8725912B2 (en) 2014-05-13
EP2616936B1 (en) 2016-05-04
US8725913B2 (en) 2014-05-13
CN103210374A (zh) 2013-07-17
WO2012036960A1 (en) 2012-03-22
CN103201722A (zh) 2013-07-10
US20120072621A1 (en) 2012-03-22
CN103189845A (zh) 2013-07-03
EP2616936A1 (en) 2013-07-24
CN103189844B (zh) 2016-11-09
EP2616934A1 (en) 2013-07-24
US20120072622A1 (en) 2012-03-22
CN103210374B (zh) 2016-08-10
CN103189845B (zh) 2016-07-06
EP2616935B1 (en) 2016-07-20
CN103189844A (zh) 2013-07-03
EP2616937A1 (en) 2013-07-24
EP2616935A1 (en) 2013-07-24
EP2616934B1 (en) 2017-01-04
EP2616937B1 (en) 2016-04-27
US20120072624A1 (en) 2012-03-22

Similar Documents

Publication Publication Date Title
EP2616934B1 (en) Numa i/o framework
KR100612059B1 (ko) 분할 처리 환경에서의 자원 조절을 위한 방법, 컴퓨팅 시스템 및 그에 관한 기록 매체
EP3553655B1 (en) Distributed policy-based provisioning and enforcement for quality of service
JP6126312B2 (ja) 待ち時間の影響を受けやすい仮想マシンをサポートするように構成された仮想マシンモニタ
US9317453B2 (en) Client partition scheduling and prioritization of service partition work
EP1763749B1 (en) Facilitating access to input/output resources via an i/o partition shared by multiple consumer partitions
JP5159884B2 (ja) 論理区分の間におけるネットワーク・アダプタ・リソース割振り
CN102473106B (zh) 虚拟环境中的资源分配
US20060193327A1 (en) System and method for providing quality of service in a virtual adapter
US20110010709A1 (en) Optimizing System Performance Using Spare Cores in a Virtualized Environment
KR20040004554A (ko) 분할 처리 환경에서의 공유 i/o
CN104937584A (zh) 基于共享资源的质量向经优先级排序的虚拟机和应用程序提供优化的服务质量
KR20160031494A (ko) 캡슐화 가능한 pcie 가상화
KR20200080458A (ko) 클라우드 멀티-클러스터 장치
JP4854710B2 (ja) 仮想計算機システム及びネットワークデバイス共有方法
JP6653786B2 (ja) I/o制御方法およびi/o制御システム
Wyman et al. Multiple-logical-channel subsystems: Increasing zSeries I/O scalability and connectivity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11755204

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2011755204

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2011755204

Country of ref document: EP