US20220308940A1 - Allocating and using file descriptors for an application executing on a plurality of nodes - Google Patents

Allocating and using file descriptors for an application executing on a plurality of nodes Download PDF

Info

Publication number
US20220308940A1
US20220308940A1 US17/493,794 US202117493794A US2022308940A1 US 20220308940 A1 US20220308940 A1 US 20220308940A1 US 202117493794 A US202117493794 A US 202117493794A US 2022308940 A1 US2022308940 A1 US 2022308940A1
Authority
US
United States
Prior art keywords
node
file
application
system call
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/493,794
Inventor
Aidan Cully
Vance MILLER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
VMware LLC
Original Assignee
VMware LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by VMware LLC filed Critical VMware LLC
Priority to US17/493,794 priority Critical patent/US20220308940A1/en
Assigned to VMWARE, INC. reassignment VMWARE, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CULLY, AIDAN, MILLER, VANCE
Publication of US20220308940A1 publication Critical patent/US20220308940A1/en
Assigned to VMware LLC reassignment VMware LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: VMWARE, INC.
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Definitions

  • AI artificial intelligence
  • deep learning applications This increase in data volume requires a commensurate increase in compute power.
  • microprocessors cannot supply the needed compute power. Consequently, specialized architectures, such as accelerators and coprocessors, are taking over many of the compute tasks. These specialized architectures need to share access to large portions of system memory to achieve significant performance improvement.
  • APIs application programming interfaces
  • the intercepted APIs are sent to a node, on which a particular specialized architecture (such as graphics processing units (GPUs) of a particular vendor) is installed and executed on that node.
  • the execution relies on distributed shared memory (DSM) between central processing units (CPUs) and the GPUs.
  • DSM distributed shared memory
  • CPUs central processing units
  • RPCs remote procedure calls
  • One embodiment provides a method for allocating and using file descriptors for an application executing over a plurality of nodes, including a first node and a second node, each having a file system.
  • the method includes executing a system call from the application running on a first node to access a file in a file system, determining whether the file resides in a file system of the first node or the second node.
  • the method further includes, if the file resides on the second node, sending the system call and arguments of the system call to the second node for execution on the second node, receiving a result from the system call that is executed on the second node, and returning the result to the application on the first node.
  • Further embodiments include a device configured to carry out one or more aspects of the above method and a computer system configured to carry out one or more aspects of the above method.
  • FIG. 1 depicts an arrangement for accessing banks of GPUs in the prior art.
  • FIG. 2 depicts an arrangement for accessing banks of accelerators, according to an embodiment.
  • FIG. 3 depicts a representative system in which embodiments may operate.
  • FIG. 4A depicts a flow of operations for an initiator node setup, according to an embodiment.
  • FIG. 4B depicts a flow of operations for an acceptor node setup, according to an embodiment.
  • FIG. 4C depicts a flow of operations for loading an application, according to an embodiment.
  • FIG. 4D depicts a flow of operations for creating threads for an application, according to an embodiment.
  • FIG. 5A depicts a flow of operations for running the initiator node, according to an embodiment.
  • FIG. 5B depicts a flow of operations for running an acceptor node, according to an embodiment.
  • FIG. 6A depicts a flow of operations for implementing a system call on the initiator node, according to an embodiment.
  • FIG. 6B depicts a flow of operations for implementing a system call on the acceptor node, according to an embodiment.
  • FIG. 6C depicts a flow of operations for implementing a Detect Local function, according to an embodiment.
  • FIG. 7 depicts a flow of operations for loading a program file and a dynamic linker, according to an embodiment.
  • FIG. 8A depicts components in an initiator node and an acceptor node involved in setting up the initiator and acceptor nodes, according to an embodiment.
  • FIG. 8B depicts a flow of operations between initiator and acceptor nodes during address space synchronization, according to an embodiment.
  • FIG. 8C depicts a flow of operation between an initiator and acceptor nodes during the creation of a coherent application, according to an embodiment.
  • FIG. 8D depicts a flow of operations between an initiator and acceptor nodes during the establishment of runtimes, according to an embodiment.
  • FIG. 9 depicts a flow of operations for accessing a file, according to an embodiment.
  • an application is co-executed among a plurality of nodes, where each node has installed thereon a plurality of specialized architecture coprocessors, including those for artificial intelligence (AI) and machine learning (ML) workloads.
  • Such applications have their own runtimes, and these runtimes offer a way of capturing these workloads by virtualizing the runtimes.
  • New architectures are easier to handle because of the virtualized runtime, and coherence among nodes is improved because the code for a specialized architecture runs locally to the specialized architecture.
  • An application monitor is established on each of the nodes on which the application is co-executed. The application monitors maintain the needed coherence among the nodes to virtualize the runtime and engages semantic-aware hooks to reduce unnecessary synchronization in the maintenance of the coherence.
  • FIG. 1 depicts an arrangement for accessing banks of GPUs in the prior art.
  • users 102 interact through a virtualized cluster of hosts 104 , which is connected via a network 112 to nodes 106 , 108 , 110 , containing a bank of GPUs of a particular vendor.
  • Each node 106 , 108 , and 110 is a server with a hardware platform and an operating system.
  • Each node is configured with the GPUs of the particular vendor.
  • Compute nodes in virtualized cluster of hosts 104 send APIs, which are specific to the GPUs, to nodes 106 , 108 , 110 for execution on the GPUs.
  • FIG. 2 depicts an arrangement for accessing banks of accelerators, according to an embodiment.
  • users 102 interact through a virtualized cluster of hosts 104 , which is connected via a network 112 to nodes 206 , 208 , 210 , where each node is a server-type architecture having a hardware platform, operating system, and possibly a virtualization layer.
  • the hardware platform includes CPUs, RAM, network interface controllers, and storage controllers.
  • the operating system may be a Linux® operating system or Windows® operating system.
  • a virtualization layer may be present, and the above-operating systems may operate above the virtualization layer.
  • each node contains banks of heterogeneous accelerators.
  • each node 206 , 208 , 210 can contain many different types of accelerators, including ones from different vendors.
  • Compute nodes in virtualized cluster of hosts 104 send requests to nodes 206 , 208 , 210 to run portions of applications installed in the computer nodes, on a runtime installed on nodes 206 , 208 , 210 .
  • nodes 206 , 208 , 210 are nodes with large amounts of memory, and portions of a large database or other application are installed on the nodes 206 , 208 , 210 to run thereon, taking advantage of the node with the large amounts of memory. Portions of the application are targeted for execution on nodes having large amounts of memory instead of specific accelerators.
  • Python® Languages often used for programming the specialized architectures or accelerators include Python®.
  • the source code is parsed and compiled to byte code, which is encapsulated in Python code objects.
  • the code objects are then executed by a Python virtual machine that interprets the code objects.
  • the Python virtual machine is a stack-oriented machine whose instructions are executed by a number of co-operating threads.
  • the Python language is often supplemented with platforms or interfaces that provide a set of tools, libraries, and resources for easing the programming task.
  • One such platform is TensorFlow®, in which the basic unit of computation is a computation graph.
  • the computation graph includes nodes and edges, where each node represents an operation, and each edge describes a tensor that gets transferred between the nodes.
  • the computation graph in TensorFlow is a static graph that can be optimized.
  • Another such platform is PyTorch®, which is an open-source machine-learning library. PyTorch also employs computational graphs, but the graphs are dynamic instead of static. Because computation graphs provide a standardized representation of computation, they can become modules deployable for computation over a plurality of nodes.
  • an application is co-executed among a plurality of nodes.
  • runtime and application monitors are established in each of the nodes.
  • the runtimes are virtual machines that run a compiled version of the code of the application, and the application monitors co-ordinate the activity of the runtimes on each of the nodes.
  • FIG. 3 depicts a representative system in which embodiments may operate.
  • the system includes two nodes, an initiator node 206 that starts up the system and thereafter operates as a peer node and one or more acceptor nodes 208 (only one of which is depicted).
  • Initiator node 206 and acceptor node 208 each include a process container 302 , 308 containing an application 314 , a runtime 316 , 338 , an application monitor 318 , 340 , one or more threads of execution 320 , 346 , data pages 324 , 348 , and code pages 322 , 350 for the threads.
  • Process container 302 , 308 runs in userspace.
  • process containers 302 , 308 are Docker® containers
  • runtimes 316 , 338 are Python virtual machines
  • application 314 is a Python program, with libraries such as TensorFlow or PyTorch
  • threads 320 , 346 correspond to the threads of the Python virtual machine.
  • Application monitor 340 on initiator node 206 includes a dynamic linker (DL) 344 and a configuration file 342 for configuring the participating nodes.
  • DL dynamic linker
  • a dynamic linker is a part of an OS that loads and links libraries and other modules as needed by an executable code while the code is being executed.
  • the initiator node sets up an acceptor node to have an application monitor with a DL and configuration file, and the application program is loaded onto the acceptor node.
  • Each node 206 , 208 further includes an operating system 304 , 310 , and a hardware platform 306 , 312 .
  • Operating system 304 , 310 such as the Linux® operating system or Windows® operating system, provides the services to run process containers 302 , 308 .
  • operating system 304 , 310 runs on hardware platform 306 , 312 .
  • operating system 304 , 310 is a guest operating system running on a virtual hardware platform of a virtual machine that is provisioned by a hypervisor from hardware platform 306 , 312 .
  • operating system 304 , 310 provides a file system 364 , 366 , which contains files and associated file descriptors, each of which is an integer identifying a file.
  • Hardware platform 306 , 312 on the nodes respectively includes one or more CPUs 326 , 352 , system memory, e.g., random access memory (RAM) 328 , 354 , one or more network interface controllers (NICs) 330 , 356 , a storage controller 332 , 358 , and a bank of heterogeneous accelerators 334 , 360 .
  • the nodes are interconnected by network 112 , such as Ethernet®, InfiniBand, or Fibre Channel.
  • Setup of the initiator node 206 and acceptor node 208 includes establishing the application monitor and runtimes on each of the nodes on which libraries or other deployable modules are to run, the coherent memory spaces in which the application, libraries or other deployable modules are located, and the initial thread of execution of each runtime. With the setup complete, the application monitors and runtimes in each node co-operate to execute the application among the plurality of nodes.
  • FIGS. 4A-4D depict a flow of operations for an initiator node 206 setup and an acceptor node 208 setup, according to an embodiment.
  • FIG. 4A depicts a flow of operations for an initiator node setup, according to an embodiment.
  • FIG. 4B depicts a flow of operations for an acceptor node setup, according to an embodiment.
  • FIG. 4C depicts a flow of operations for loading an application, according to an embodiment.
  • FIG. 4D depicts a flow of operations for creating threads for an application, according to an embodiment.
  • initiator node 206 establishes a connection to acceptor node 208 in step 402 .
  • initiator node 206 establishes an application monitor and a runtime on initiator node 206 and sends a message requesting that acceptor node 208 establish an application monitor and runtime thereon.
  • Initiator node 206 then performs a coherent load of an application binary (step 406 , further described with reference to FIG. 4C ).
  • initiator node 206 may load a library if needed.
  • a thread is started using this stack, with an entry point being the application's ‘main’ function.
  • acceptor node 208 receives a message to establish application monitor 318 and runtime 316 in step 420 .
  • acceptor node 208 receives the library or other deployable module from initiator node 206 , and in response, loads the received code for the library or other deployable module.
  • acceptor node 208 receives the request to create memory space from initiator node 206 and, in response, creates the memory space at the specified location.
  • acceptor node 208 receives a request to create the stack address space from initiator node 206 and, in response, creates and locates the requested stack address space.
  • Acceptor node 208 then receives, in step 428 , a command from initiator node 206 to form a dual (shadow) thread based on the execution thread in initiator node 206 and, in response, establishes the requested dual thread.
  • initiator node 206 synchronizes address spaces.
  • initiator node 206 establishes a virtualization boundary. Establishing the boundary includes creating a sub-process (called VProcess below) that shares an address space with its parent process and can have its system calls traced by the parent. The parent process detects the sub-process interactions with the operating system and ensures that these interactions are made coherently with the other node or nodes.
  • VProcess sub-process
  • the parent process detects the sub-process interactions with the operating system and ensures that these interactions are made coherently with the other node or nodes.
  • initiator node 206 loads the application binary and an ELF (Executable and Linkable Format) interpreter binary into the address space inside the virtualization boundary.
  • ELF Executable and Linkable Format
  • the parent process detects this address space manipulation through tracing and keeps the acceptor node coherent with changes made by the sub-process.
  • initiator node 206 populates an initial stack for the ELF interpreter binary inside the virtualization boundary
  • initiator node 206 starts executing the ELF interpreter binary on its own stack inside the virtualization boundary. Execution inside the virtualization boundary assures that address spaces and execution policies are coherent between the initiator and acceptor nodes and that any changes made by the runtime are intercepted so that consistency of the loaded application is maintained.
  • Executing the ELF interpreter binary inside the virtualization boundary may entail loading a library on the initiator or acceptor node and possibly establishing a migration policy regarding the library (e.g., pinning the library to a node, e.g., the acceptor node). Additionally, the ELF interpreter binary may establish additional coherent memory spaces, including stack spaces needed by the application.
  • initiator 206 instead of loading the application binary on initiator 206 in step 434 , initiator 206 sends to acceptor 208 a command which contains instructions about how to load the application binary, and acceptor 208 processes these instructions to load the application binary on itself.
  • coherent execution threads are established by starting an execution thread using the just created stack in step 408 .
  • a command to form a dual execution thread corresponding to an execution thread on the local node is sent to acceptor node 208 .
  • the thread information is returned.
  • the dual thread is paused or parked, awaiting a control transfer request from the local node.
  • execution moves from one node to another the register state of the local thread is recorded and sent to the other node as the local thread is parked.
  • the other node receives the register state and uses it to resume the parked dual thread. In this way, the previously active thread becomes the inactive thread, and the inactive thread becomes the currently active thread.
  • the movement of the active thread is further described with respect to FIGS. 6A and 6B .
  • An MSI-coherence protocol applied to pages maintains coherence between memory spaces on the nodes so that the threads of the runtime are operable on any of the nodes.
  • a modified (state ‘M’) memory page in one node is considered invalid (state ‘I’) in another.
  • a shared (state ‘S’) memory page is considered read-only in both nodes.
  • a code or data access to a memory page that is pinned to acceptor node 208 causes execution migration of the thread to acceptor node 208 followed by migration of the page; a data access to a memory page that is migratory triggers a migration of that memory page in a similar manner.
  • FIGS. 5A-5B describes interactions of running the application on the initiator and acceptor nodes after the setup according to FIGS. 4A-4D is completed. These interactions include, in the course of executing the application on the initiator node, executing a library or other deployable module on the acceptor node. Executing the library or other deployable module involves ‘faulting in’ the code pagers for the library or other deployable module, the data pages of the stack or other memory space, and moving execution back to the initiator node.
  • FIG. 5A depicts a flow of operations for running the initiator node, according to an embodiment.
  • acceptor node 208 is optionally pre-provisioned with stack or memory pages anticipated for executing threads on acceptor node 208 as described below.
  • acceptor node 208 is optionally pre-provisioned with functions of the library or other deployable module anticipated for the code.
  • the state of the thread is set to running.
  • the initiator code executes application 314 using the now running thread on initiator node 206 .
  • the thread determines whether the execution of a function of a library or other deployable module is needed. If not, then the thread continues execution of its workload.
  • step 512 a message is sent to acceptor node 208 to migrate the workload of the thread to acceptor node 208 .
  • step 514 the state of the local thread is set to a parked state, which means that the thread is paused but runnable on behalf of a dual thread on acceptor node 208 .
  • step 516 initiator node 206 awaits and receives a message to migrate the workload of the thread back to initiator node 206 after acceptor node 208 has finished executing the function of the library or other deployable module.
  • Pre-provisioning of the memory pages or stack pages is performed by a DWARF-type (debugging with attributed record formats) debugger data.
  • DWARF-type debugger data contains the address and sizes of all functions that can be reached from this point in the call graph, allowing the code pages to be sent to acceptor node 208 prior to being brought in by demand-paging. In this way, acceptor node 208 can pre-provision the memory it needs to perform its function prior to resuming execution.
  • FIG. 5B depicts a flow of operations for running an acceptor node, according to an embodiment.
  • the state of the local thread is initially set to parked.
  • one of five events occurs on acceptor node 208 .
  • the events are ‘migrate to acceptor’, ‘module fault’, ‘stack fault’, ‘application code execution’, or ‘default’.
  • the module fault and stack fault are examples of a memory fault which may include other types of memory faults, such as a heap fault and code fault, not described. The different types of memory faults are handled in a similar manner.
  • step 556 If the event is ‘migrate to acceptor’, then the state of the local thread is set to running in step 556 . Flow continues to step 574 , which maintains the thread's current state, and to step 576 , where acceptor node 208 determines whether the thread is terminated. If not, control continues to step 554 to await the next event, such as a ‘library fault’, a ‘stack fault’, ‘execution of the application’.
  • step 574 which maintains the thread's current state
  • step 576 where acceptor node 208 determines whether the thread is terminated. If not, control continues to step 554 to await the next event, such as a ‘library fault’, a ‘stack fault’, ‘execution of the application’.
  • acceptor node 208 If the event is a ‘module fault’, e.g., a library fault, then the state of the thread is set to parked in step 558 , and in step 560 , acceptor node 208 requests and receives a code page of the library or other deployable module not yet paged in from initiator node 206 . In step 562 , acceptor node 208 sets the state of the local thread to running, and the flow continues with the local thread running through steps 574 , 576 , 554 to await the next event if the thread is not terminated.
  • a ‘module fault’ e.g., a library fault
  • step 564 the thread's state is set to parked in step 564 , and the initiator node 206 sends a request to receive a stack page not yet paged in from initiator 206 .
  • step 568 the thread's state is set to running, and the flow continues through steps 574 , 576 , and 554 to await the next event assuming no thread termination.
  • step 570 If the event is ‘application code execution’, then the state of the local thread is set to parked in step 570 , and acceptor node 208 sends a ‘migrate control’ message to initiator node 206 in step 572 . Flow continues through steps 574 , 576 , and 554 to await the next event.
  • step 574 If the event is ‘default’ (i.e., any other event), then the thread's state is maintained in step 574 , and flow continues through steps 576 and 554 to await the next event.
  • step 576 If the thread terminates as determined in step 576 , the stack is sent back to initiator node 206 in step 578 , and flow continues at step 554 , awaiting the next event. If no event occurs, then ‘default’ occurs, which loops via steps 574 and 554 to maintain the thread's current state.
  • FIGS. 6A-6C depict the flow of operations to execute and possible move execution of a system call. Specifically, FIG. 6A depicts a flow of operations for implementing a system call on the initiator node, according to an embodiment.
  • FIG. 6B depicts a flow of operations for implementing a system call on the acceptor node, according to an embodiment.
  • FIG. 6C depicts a flow of operations for implementing a Detect Local function, according to an embodiment.
  • a thread running in the local node makes a system call.
  • the application monitor on the local node receives the system call via a program that is responsible for manipulating interactions with the virtualization boundary (called VpExit below).
  • the application monitor determines whether the arguments involve local or remote resources.
  • the system call involves remote resources (‘No’ branch)
  • the running thread is parked, and in step 610 , the application monitor sends the system call and its arguments to the application monitor on the remote node that is to handle the system call.
  • step 612 the application monitor on the local node awaits completion and results of the system call, and in step 614 , the running thread receives the results of the system call (via VpExit) and is made active again.
  • step 608 if the system call involves only local resources (‘Yes’) branch, then the local node handles the system call in step 616 .
  • step 632 the application monitor on the remote node receives the system call and its arguments.
  • step 634 the state of the parked thread is set to active (i.e., running) and the remote node handles the system call in step 636 .
  • step 638 the results of the system call are returned to the thread that made the call, which provides in step 640 the results to the application monitor, after which in step 642 , the state of the thread is set back to the parked state.
  • step 644 the application monitor sends the completion and results back to the local node.
  • step 652 the function gets all of the system call arguments and in step 654 determines for system calls, other than a file access, whether the arguments interact with a resource pinned on another node, which is either a different acceptor node or the initiator node. If so, then the function returns ‘True’ in step 656 . Otherwise, the function returns ‘False’ in step 658 . If the system call is a file access, then the flow executes step 655 , which is further described with reference to FIG. 9 .
  • FIG. 7 depicts a flow of operations for loading a program file and a dynamic linker, according to an embodiment.
  • the flow of operations of FIG. 7 describes in more detail the step of loading the application according to step 432 of FIG. 4C , where the loading is performed by the operating system, the application monitor, and the dynamic linker.
  • step 702 application monitor 340 loads the ELF program file and gets a file system path for the ELF interpreter binary.
  • application monitor 340 prepares an initial stack frame for a binary of application program 314 (hereinafter referred to as “primary binary”).
  • step 708 application monitor 340 acquires the primary binary using the ELF interpreter and informs the binary of the initial stack frame.
  • application monitor 340 starts DL 344 , which was loaded by operating system 310 .
  • step 710 DL 344 runs, and in step 712 , DL 344 relocates the primary binary and DL 344 to executable locations, which are locations in system memory from which code execution is allowed by the OS.
  • step 714 DL 344 loads the program dependencies (of the library or other deployable module) and alters the system call table to intercept all system calls made by the primary binary. Some system calls are allowed through unchanged, while others are altered when DL 344 interacts with operating system 310 .
  • step 716 DL 344 causes the relocated primary binary of application program 314 to run at the executable location. As a result, both application program 314 and DL 344 run in userspace. Running in user space allows loading of the library or other deployable to be within the virtualization boundary.
  • DL 344 can replace certain function calls that go through the library or other deployable modules with customized versions to add functional augmentation based on known semantics.
  • DL 344 assures, via the application monitor, that threads see a consistent view of the address space, so execution of threads may migrate over the nodes.
  • a ‘ptrace’ system call is used to track the execution of DL 344 to find how it interacts with operating system 310 . Interactions are then rewritten so that they run coherently between initiator node 206 and acceptor node 208 .
  • all interactions with operating system 310 go through symbols defined by DL 344 or resolved through DL 344 .
  • FIGS. 8A-8D describe the components and operations in more detail during the setup of the initiator node and acceptor node corresponding to steps 404 , 442 , 446 , 450 , 464 , 466 of FIGS. 4A-4D .
  • FIG. 8A depicts components in an initiator node and an acceptor node involved in setting up the initiator and acceptor nodes, according to an embodiment.
  • FIG. 8B depicts a flow of operations between initiator and acceptor nodes during address space synchronization, according to an embodiment.
  • FIG. 8C depicts a flow of operation between an initiator and acceptor nodes during the creation of a coherent application, according to an embodiment.
  • FIG. 8D depicts a flow of operations between an initiator and acceptor nodes during the establishment of runtimes, according to an embodiment.
  • initiator node 206 includes a VProcess 802 , a Runtime module 804 , a Bootstrap module 806 , and a VpExit module 808 .
  • Acceptor node 208 includes similar components 822 , 824 , 826 , 828 as on initiator node 206 , along with an Init module 830 .
  • VpExit modules 808 and 828 are responsible for manipulating VProcess 802 and 822 interactions across their respective virtualization boundaries.
  • the acceptor Init module 830 receives a ‘hello function’ designating the address space from initiator Runtime 804 .
  • acceptor Init module 830 sends a ‘create VpExit’ message to acceptor Bootstrap module 826 .
  • acceptor Init module 830 sends an acknowledgment regarding the address space message back to initiator node 206 .
  • a synchronized address space is established between the initiator 207 and acceptor 208 .
  • initiator node 206 sends a ‘create VpExit’ message to initiator Bootstrap module 806 .
  • initiator node 206 sends a ‘create’ message to VProcess 802 of initiator node 206 , which receives a ‘load VpExit’ message in step 842 from initiator 206 .
  • VProcess 802 is created outside of the Remote Procedure Call (RPC) layer, and the resources that VProcess 802 uses are virtualized.
  • RPC Remote Procedure Call
  • VProcess 802 sends a ‘Mmap’ message to VpExit module 808 of initiator node 206 , which sends a ‘mmap’ message in step 845 to initiator 206 and an ‘update the address map’ message in step 846 to Bootstrap module 826 of acceptor node 208 .
  • Bootstrap module 826 of acceptor node 208 sends an acknowledgment (‘ok’) back to initiator node 206 , which relays the message in step 850 to VpExit module 808 , which relays the message to VProcess 802 in step 852 .
  • the address map of the application on the initiator is made coherent with the acceptor node.
  • initiator VProcess 802 sends a ‘VpExit(Enter, hook_page)’ message to VpExit module 808 .
  • VpExit module 808 sends an ‘Enter(hook_page)’ message to initiator Bootstrap module 806 .
  • initiator Bootstrap module 806 sends a create(VpExit)′ message to initiator Runtime 804 .
  • initiator Bootstrap module 806 sends a ‘bootstrap(Runtime, hook_page)’ message to acceptor Bootstrap module 826 , which sends in step 862 an ‘install(VpExit, hook_page)’ message to acceptor Runtime module 824 .
  • acceptor Runtime module 824 sends an ‘install(VpExit)’ message to acceptor VProcess 822 .
  • acceptor Bootstrap module 826 sends a ‘Runtime’ message to initiator Bootstrap module 806 , which returns in step 868 to VpExit module 808 , which returns in step 870 to VProcess 802 .
  • initiator node 206 and acceptor node 208 have both created runtimes for VProcess 802 , VProcess 822 , and the memory and address space for VProcess 802 and 822 are coherent.
  • initiator node 206 uses the system ‘ptrace’ facility to intercept system calls generated by the virtual process.
  • the application monitor runs in the same address space as the virtual process, which means that the application monitor is in the same physical process as the virtual process.
  • Linux's clone(2) system call allows the virtual process to be traced.
  • the virtual process issues SIGSTOP to itself, which pauses execution of the virtual process before allocating any virtual process resources.
  • the application monitor attaches to the virtual process via ‘ptrace’, which allows it to continue execution (using SIGCONT) from the point at which the virtual process entered SIGSTOP.
  • SIGCONT SIGCONT
  • FIG. 9 depicts a flow of operations for accessing a file, according to an embodiment.
  • a file system resides on each of the nodes. Access to one or more files in the file systems may be requested by the application during execution by making a system call. If the requested file resides on a node making the system call, the file is available locally. However, if the file resides on a different node (another acceptor node or the initiator node), the system call is remotely executed according to FIGS. 6A-6C . According to step 655 of FIG. 6C , the system call determines whether the arguments of the system call interact with a remote pinned resource, which is a file that is not local to the node receiving the system call.
  • the steps of FIG. 9 depict the use of the file descriptor, which was returned during a previous system call in which the file was opened to determine which node on which the system call is to be executed.
  • the flow tests the file descriptor against a criterion.
  • the criterion is whether the file descriptor obtained in step 654 of FIG. 6C (during an open(filename) or other system call which returns the file descriptor fd) is even or not. If the file descriptor is an even integer, as determined in step 902 , initiator node 206 is determined to have the file in step 904 because only files with even fds can be stored on the initiator. If the current node is initiator node 206 , as determined in step 910 , then a ‘False’ value is returned in step 916 .
  • the ‘False’ value indicates that the system call arguments do not interact with a remote pinned resource, and the system call is handled locally. If the current node is acceptor node 208 as determined in step 912 , then a ‘True’ value is returned in step 914 . The ‘True’ value indicates that the system call arguments do interact with a remote pinned resource, and the system call is to be handled remotely.
  • the criterion is whether the file descriptor is less than a specified integer, say 512. If so, as determined in step 902 , initiator node 206 is determined to have the file in step 904 because only files with fds less than 512 are stored on the initiator. If the current node is initiator node 206 , as determined in step 910 , then a ‘False’ value is returned in step 916 . The ‘False’ value indicates that the system call arguments do not interact with a remote pinned resource, and the system call is handled locally. If the current node is acceptor node 208 as determined in step 912 , then a ‘True’ value is returned in step 914 . The ‘True’ value indicates that the system call arguments do interact with a remote pinned resource, and the system call is to be handled remotely.
  • a specified integer say 512.
  • acceptor node 208 is determined to have the file in step 906 because only files with fds greater than 512 are stored on the acceptor node, and in step 916 , a ‘False’ value is returned. Otherwise, a ‘True’ value is returned in step 914 .
  • Certain embodiments as described above involve a hardware abstraction layer on top of a host computer.
  • the hardware abstraction layer allows multiple contexts to share the hardware resource. These contexts are isolated from each other in one embodiment, each having at least a user application program running therein.
  • the hardware abstraction layer thus provides benefits of resource isolation and allocation among the contexts.
  • virtual machines are used as an example for the contexts and hypervisors as an example for the hardware abstraction layer.
  • each virtual machine includes a guest operating system in which at least one application program runs.
  • OS-less containers such as containers not including a guest operating system, referred to herein as “OS-less containers” (see, e.g., www.docker.com).
  • OS-less containers implement operating system-level virtualization, wherein an abstraction layer is provided on top of the kernel of an operating system on a host computer.
  • the abstraction layer supports multiple OS-less containers, each including an application program and its dependencies.
  • Each OS-less container runs as an isolated process in userspace on the host operating system and shares the kernel with other containers.
  • the OS-less container relies on the kernel's functionality to make use of resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces and to completely isolate the application program's view of the operating environments.
  • resource isolation CPU, memory, block I/O, network, etc.
  • By using OS-less containers resources can be isolated, services restricted, and processes provisioned to have a private view of the operating system with their own process ID space, file system structure, and network interfaces.
  • Multiple containers can share the same kernel, but each container can be constrained only to use a defined amount of resources such as CPU, memory, and I/O.
  • Certain embodiments may be implemented in a host computer without a hardware abstraction layer or an OS-less container.
  • certain embodiments may be implemented in a host computer running a Linux® or Windows® operating system.
  • One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer-readable media.
  • the term computer-readable medium refers to any data storage device that can store data which can thereafter be input to a computer system.
  • Computer-readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer.
  • Examples of a computer-readable medium include a hard drive, network-attached storage (NAS), read-only memory, random-access memory (e.g., a flash memory device), a CD (Compact Discs)—CD-ROM, a CDR, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices.
  • NAS network-attached storage
  • read-only memory e.g., a flash memory device
  • CD Compact Discs
  • CD-ROM Compact Discs
  • CDR Compact Disc
  • CD-RW Digital Versatile Disc
  • DVD Digital Versatile Disc

Abstract

A method for allocating and using file descriptors for an application executing over a plurality of nodes, each having a file system, includes receiving a system call from the application running on a first node to access a file in a file system, determining whether the file resides in a file system of a first node or the second node, and upon determining that the file resides on the second node, sending the system call and arguments of the system call to the one of the second nodes for execution on the one of the second nodes and returning a result of the system call executed on the second node to the application on the first node.

Description

    CROSS-REFERENCE TO RELATED APPLICATION(S)
  • This application claims the benefit of U.S. Provisional Application No. 63/164,955, filed on Mar. 23, 2021, which is incorporated by reference herein.
  • BACKGROUND
  • Data volume is increasing due to artificial intelligence (AI) and deep learning applications. This increase in data volume requires a commensurate increase in compute power. However, microprocessors cannot supply the needed compute power. Consequently, specialized architectures, such as accelerators and coprocessors, are taking over many of the compute tasks. These specialized architectures need to share access to large portions of system memory to achieve significant performance improvement.
  • Using specialized architectures creates new problems to be solved. Virtualizing specialized architectures is difficult, requiring high investment and strong vendor support because the architectures are usually proprietary.
  • One solution is intercepting the programming interfaces for the architecture, i.e., the application programming interfaces (APIs). In this solution, the intercepted APIs are sent to a node, on which a particular specialized architecture (such as graphics processing units (GPUs) of a particular vendor) is installed and executed on that node. The execution relies on distributed shared memory (DSM) between central processing units (CPUs) and the GPUs. When tight memory coherence is needed between the CPUs and GPUs, remote procedure calls (RPCs) are used, which requires high traffic between nodes and highly detailed knowledge of the API semantics and the GPUs.
  • A better solution is needed, i.e., one that can handle specialized architectures of not just one but many different vendors on the same node without requiring specialized knowledge of the specialized architecture.
  • SUMMARY
  • One embodiment provides a method for allocating and using file descriptors for an application executing over a plurality of nodes, including a first node and a second node, each having a file system. The method includes executing a system call from the application running on a first node to access a file in a file system, determining whether the file resides in a file system of the first node or the second node. The method further includes, if the file resides on the second node, sending the system call and arguments of the system call to the second node for execution on the second node, receiving a result from the system call that is executed on the second node, and returning the result to the application on the first node.
  • Further embodiments include a device configured to carry out one or more aspects of the above method and a computer system configured to carry out one or more aspects of the above method.
  • BRIEF DESCRIPTION OF DRAWINGS
  • FIG. 1 depicts an arrangement for accessing banks of GPUs in the prior art.
  • FIG. 2 depicts an arrangement for accessing banks of accelerators, according to an embodiment.
  • FIG. 3 depicts a representative system in which embodiments may operate.
  • FIG. 4A depicts a flow of operations for an initiator node setup, according to an embodiment.
  • FIG. 4B depicts a flow of operations for an acceptor node setup, according to an embodiment.
  • FIG. 4C depicts a flow of operations for loading an application, according to an embodiment.
  • FIG. 4D depicts a flow of operations for creating threads for an application, according to an embodiment.
  • FIG. 5A depicts a flow of operations for running the initiator node, according to an embodiment.
  • FIG. 5B depicts a flow of operations for running an acceptor node, according to an embodiment.
  • FIG. 6A depicts a flow of operations for implementing a system call on the initiator node, according to an embodiment.
  • FIG. 6B depicts a flow of operations for implementing a system call on the acceptor node, according to an embodiment.
  • FIG. 6C depicts a flow of operations for implementing a Detect Local function, according to an embodiment.
  • FIG. 7 depicts a flow of operations for loading a program file and a dynamic linker, according to an embodiment.
  • FIG. 8A depicts components in an initiator node and an acceptor node involved in setting up the initiator and acceptor nodes, according to an embodiment.
  • FIG. 8B depicts a flow of operations between initiator and acceptor nodes during address space synchronization, according to an embodiment.
  • FIG. 8C depicts a flow of operation between an initiator and acceptor nodes during the creation of a coherent application, according to an embodiment.
  • FIG. 8D depicts a flow of operations between an initiator and acceptor nodes during the establishment of runtimes, according to an embodiment.
  • FIG. 9 depicts a flow of operations for accessing a file, according to an embodiment.
  • DETAILED DESCRIPTION
  • In the embodiments, an application is co-executed among a plurality of nodes, where each node has installed thereon a plurality of specialized architecture coprocessors, including those for artificial intelligence (AI) and machine learning (ML) workloads. Such applications have their own runtimes, and these runtimes offer a way of capturing these workloads by virtualizing the runtimes. New architectures are easier to handle because of the virtualized runtime, and coherence among nodes is improved because the code for a specialized architecture runs locally to the specialized architecture. An application monitor is established on each of the nodes on which the application is co-executed. The application monitors maintain the needed coherence among the nodes to virtualize the runtime and engages semantic-aware hooks to reduce unnecessary synchronization in the maintenance of the coherence.
  • FIG. 1 depicts an arrangement for accessing banks of GPUs in the prior art. In the arrangement depicted, users 102 interact through a virtualized cluster of hosts 104, which is connected via a network 112 to nodes 106, 108, 110, containing a bank of GPUs of a particular vendor. Each node 106, 108, and 110 is a server with a hardware platform and an operating system. Each node is configured with the GPUs of the particular vendor. Compute nodes in virtualized cluster of hosts 104 send APIs, which are specific to the GPUs, to nodes 106, 108, 110 for execution on the GPUs.
  • FIG. 2 depicts an arrangement for accessing banks of accelerators, according to an embodiment. In the arrangement depicted, users 102 interact through a virtualized cluster of hosts 104, which is connected via a network 112 to nodes 206, 208, 210, where each node is a server-type architecture having a hardware platform, operating system, and possibly a virtualization layer. The hardware platform includes CPUs, RAM, network interface controllers, and storage controllers. The operating system may be a Linux® operating system or Windows® operating system. A virtualization layer may be present, and the above-operating systems may operate above the virtualization layer. In addition, in the figure, each node contains banks of heterogeneous accelerators. That is, each node 206, 208, 210 can contain many different types of accelerators, including ones from different vendors. Compute nodes in virtualized cluster of hosts 104 send requests to nodes 206, 208, 210 to run portions of applications installed in the computer nodes, on a runtime installed on nodes 206, 208, 210.
  • In an alternative embodiment, nodes 206, 208, 210 are nodes with large amounts of memory, and portions of a large database or other application are installed on the nodes 206, 208, 210 to run thereon, taking advantage of the node with the large amounts of memory. Portions of the application are targeted for execution on nodes having large amounts of memory instead of specific accelerators.
  • Languages often used for programming the specialized architectures or accelerators include Python®. In the Python language, the source code is parsed and compiled to byte code, which is encapsulated in Python code objects. The code objects are then executed by a Python virtual machine that interprets the code objects. The Python virtual machine is a stack-oriented machine whose instructions are executed by a number of co-operating threads. The Python language is often supplemented with platforms or interfaces that provide a set of tools, libraries, and resources for easing the programming task. One such platform is TensorFlow®, in which the basic unit of computation is a computation graph. The computation graph includes nodes and edges, where each node represents an operation, and each edge describes a tensor that gets transferred between the nodes. The computation graph in TensorFlow is a static graph that can be optimized. Another such platform is PyTorch®, which is an open-source machine-learning library. PyTorch also employs computational graphs, but the graphs are dynamic instead of static. Because computation graphs provide a standardized representation of computation, they can become modules deployable for computation over a plurality of nodes.
  • In the embodiments, an application is co-executed among a plurality of nodes. To enable such co-execution, runtime and application monitors are established in each of the nodes. The runtimes are virtual machines that run a compiled version of the code of the application, and the application monitors co-ordinate the activity of the runtimes on each of the nodes.
  • FIG. 3 depicts a representative system in which embodiments may operate. The system includes two nodes, an initiator node 206 that starts up the system and thereafter operates as a peer node and one or more acceptor nodes 208 (only one of which is depicted). Initiator node 206 and acceptor node 208 each include a process container 302, 308 containing an application 314, a runtime 316, 338, an application monitor 318, 340, one or more threads of execution 320, 346, data pages 324, 348, and code pages 322, 350 for the threads. Process container 302, 308 runs in userspace. In one embodiment, process containers 302, 308 are Docker® containers, runtimes 316, 338 are Python virtual machines, application 314 is a Python program, with libraries such as TensorFlow or PyTorch, and threads 320, 346 correspond to the threads of the Python virtual machine. Application monitor 340 on initiator node 206 includes a dynamic linker (DL) 344 and a configuration file 342 for configuring the participating nodes. In general, a dynamic linker is a part of an OS that loads and links libraries and other modules as needed by an executable code while the code is being executed. Alternatively, the initiator node sets up an acceptor node to have an application monitor with a DL and configuration file, and the application program is loaded onto the acceptor node.
  • Each node 206, 208 further includes an operating system 304, 310, and a hardware platform 306, 312. Operating system 304, 310, such as the Linux® operating system or Windows® operating system, provides the services to run process containers 302, 308. In some embodiments, operating system 304, 310 runs on hardware platform 306, 312. In other embodiments, operating system 304, 310 is a guest operating system running on a virtual hardware platform of a virtual machine that is provisioned by a hypervisor from hardware platform 306, 312. In addition, operating system 304, 310 provides a file system 364, 366, which contains files and associated file descriptors, each of which is an integer identifying a file.
  • Hardware platform 306, 312 on the nodes respectively includes one or more CPUs 326, 352, system memory, e.g., random access memory (RAM) 328, 354, one or more network interface controllers (NICs) 330, 356, a storage controller 332, 358, and a bank of heterogeneous accelerators 334, 360. The nodes are interconnected by network 112, such as Ethernet®, InfiniBand, or Fibre Channel.
  • Before running an application over a plurality of nodes, the nodes are set up. Setup of the initiator node 206 and acceptor node 208 includes establishing the application monitor and runtimes on each of the nodes on which libraries or other deployable modules are to run, the coherent memory spaces in which the application, libraries or other deployable modules are located, and the initial thread of execution of each runtime. With the setup complete, the application monitors and runtimes in each node co-operate to execute the application among the plurality of nodes.
  • FIGS. 4A-4D depict a flow of operations for an initiator node 206 setup and an acceptor node 208 setup, according to an embodiment. Specifically, FIG. 4A depicts a flow of operations for an initiator node setup, according to an embodiment. FIG. 4B depicts a flow of operations for an acceptor node setup, according to an embodiment. FIG. 4C depicts a flow of operations for loading an application, according to an embodiment. FIG. 4D depicts a flow of operations for creating threads for an application, according to an embodiment.
  • Referring to FIG. 4A, on start-up, initiator node 206 establishes a connection to acceptor node 208 in step 402. In step 404, initiator node 206 establishes an application monitor and a runtime on initiator node 206 and sends a message requesting that acceptor node 208 establish an application monitor and runtime thereon. Initiator node 206 then performs a coherent load of an application binary (step 406, further described with reference to FIG. 4C). In step 408, initiator node 206 may load a library if needed. In step 412, further described with reference to FIG. 4D, a thread is started using this stack, with an entry point being the application's ‘main’ function.
  • Referring to FIG. 4B, on start-up, acceptor node 208 receives a message to establish application monitor 318 and runtime 316 in step 420. In step 422, acceptor node 208 receives the library or other deployable module from initiator node 206, and in response, loads the received code for the library or other deployable module. In step 424, acceptor node 208 receives the request to create memory space from initiator node 206 and, in response, creates the memory space at the specified location. In step 426, acceptor node 208 receives a request to create the stack address space from initiator node 206 and, in response, creates and locates the requested stack address space. Acceptor node 208 then receives, in step 428, a command from initiator node 206 to form a dual (shadow) thread based on the execution thread in initiator node 206 and, in response, establishes the requested dual thread.
  • Referring to FIG. 4C, in step 432, initiator node 206 synchronizes address spaces. In step 434, initiator node 206 establishes a virtualization boundary. Establishing the boundary includes creating a sub-process (called VProcess below) that shares an address space with its parent process and can have its system calls traced by the parent. The parent process detects the sub-process interactions with the operating system and ensures that these interactions are made coherently with the other node or nodes. In step 436, initiator node 206 loads the application binary and an ELF (Executable and Linkable Format) interpreter binary into the address space inside the virtualization boundary. The parent process detects this address space manipulation through tracing and keeps the acceptor node coherent with changes made by the sub-process. In step 438, initiator node 206 populates an initial stack for the ELF interpreter binary inside the virtualization boundary, and in step 440, initiator node 206 starts executing the ELF interpreter binary on its own stack inside the virtualization boundary. Execution inside the virtualization boundary assures that address spaces and execution policies are coherent between the initiator and acceptor nodes and that any changes made by the runtime are intercepted so that consistency of the loaded application is maintained.
  • Executing the ELF interpreter binary inside the virtualization boundary may entail loading a library on the initiator or acceptor node and possibly establishing a migration policy regarding the library (e.g., pinning the library to a node, e.g., the acceptor node). Additionally, the ELF interpreter binary may establish additional coherent memory spaces, including stack spaces needed by the application.
  • In an alternative embodiment, instead of loading the application binary on initiator 206 in step 434, initiator 206 sends to acceptor 208 a command which contains instructions about how to load the application binary, and acceptor 208 processes these instructions to load the application binary on itself.
  • Referring to FIG. 4D, coherent execution threads are established by starting an execution thread using the just created stack in step 408. In step 484, a command to form a dual execution thread corresponding to an execution thread on the local node is sent to acceptor node 208. In step 486, the thread information is returned. The dual thread is paused or parked, awaiting a control transfer request from the local node. When execution moves from one node to another, the register state of the local thread is recorded and sent to the other node as the local thread is parked. The other node receives the register state and uses it to resume the parked dual thread. In this way, the previously active thread becomes the inactive thread, and the inactive thread becomes the currently active thread. The movement of the active thread is further described with respect to FIGS. 6A and 6B.
  • An MSI-coherence protocol applied to pages maintains coherence between memory spaces on the nodes so that the threads of the runtime are operable on any of the nodes. A modified (state ‘M’) memory page in one node is considered invalid (state ‘I’) in another. A shared (state ‘S’) memory page is considered read-only in both nodes. A code or data access to a memory page that is pinned to acceptor node 208 causes execution migration of the thread to acceptor node 208 followed by migration of the page; a data access to a memory page that is migratory triggers a migration of that memory page in a similar manner. In an alternate embodiment, upon a fault caused by an instruction accessing a code or data page on acceptor node 208, only the instruction is executed on the node having the code or data page, and the results of the instruction are transferred over the network to the acceptor node.
  • FIGS. 5A-5B describes interactions of running the application on the initiator and acceptor nodes after the setup according to FIGS. 4A-4D is completed. These interactions include, in the course of executing the application on the initiator node, executing a library or other deployable module on the acceptor node. Executing the library or other deployable module involves ‘faulting in’ the code pagers for the library or other deployable module, the data pages of the stack or other memory space, and moving execution back to the initiator node.
  • FIG. 5A depicts a flow of operations for running the initiator node, according to an embodiment. In step 502, acceptor node 208 is optionally pre-provisioned with stack or memory pages anticipated for executing threads on acceptor node 208 as described below. In step 504, acceptor node 208 is optionally pre-provisioned with functions of the library or other deployable module anticipated for the code. In step 506, the state of the thread is set to running. In step 508, the initiator code executes application 314 using the now running thread on initiator node 206. In step 510, the thread determines whether the execution of a function of a library or other deployable module is needed. If not, then the thread continues execution of its workload. If execution of a library or module function is needed, then in step 512, a message is sent to acceptor node 208 to migrate the workload of the thread to acceptor node 208. In step 514, the state of the local thread is set to a parked state, which means that the thread is paused but runnable on behalf of a dual thread on acceptor node 208. In step 516, initiator node 206 awaits and receives a message to migrate the workload of the thread back to initiator node 206 after acceptor node 208 has finished executing the function of the library or other deployable module.
  • Pre-provisioning of the memory pages or stack pages is performed by a DWARF-type (debugging with attributed record formats) debugger data. When initiator node 206 takes a fault on entry to the acceptor-pinned function, it analyzes the DWARF data for the target function, determines that it takes a pointer argument, sends the memory starting at the pointer to acceptor node 208, and sends the current page of the stack to acceptor node 208. The DWARF debugger data contains the address and sizes of all functions that can be reached from this point in the call graph, allowing the code pages to be sent to acceptor node 208 prior to being brought in by demand-paging. In this way, acceptor node 208 can pre-provision the memory it needs to perform its function prior to resuming execution.
  • FIG. 5B depicts a flow of operations for running an acceptor node, according to an embodiment. In step 552, the state of the local thread is initially set to parked. In step 554, one of five events occurs on acceptor node 208. The events are ‘migrate to acceptor’, ‘module fault’, ‘stack fault’, ‘application code execution’, or ‘default’. The module fault and stack fault, though specifically described, are examples of a memory fault which may include other types of memory faults, such as a heap fault and code fault, not described. The different types of memory faults are handled in a similar manner.
  • If the event is ‘migrate to acceptor’, then the state of the local thread is set to running in step 556. Flow continues to step 574, which maintains the thread's current state, and to step 576, where acceptor node 208 determines whether the thread is terminated. If not, control continues to step 554 to await the next event, such as a ‘library fault’, a ‘stack fault’, ‘execution of the application’.
  • If the event is a ‘module fault’, e.g., a library fault, then the state of the thread is set to parked in step 558, and in step 560, acceptor node 208 requests and receives a code page of the library or other deployable module not yet paged in from initiator node 206. In step 562, acceptor node 208 sets the state of the local thread to running, and the flow continues with the local thread running through steps 574, 576, 554 to await the next event if the thread is not terminated.
  • If the event is a ‘stack fault’, then the thread's state is set to parked in step 564, and the initiator node 206 sends a request to receive a stack page not yet paged in from initiator 206. In step 568, the thread's state is set to running, and the flow continues through steps 574, 576, and 554 to await the next event assuming no thread termination.
  • If the event is ‘application code execution’, then the state of the local thread is set to parked in step 570, and acceptor node 208 sends a ‘migrate control’ message to initiator node 206 in step 572. Flow continues through steps 574, 576, and 554 to await the next event.
  • If the event is ‘default’ (i.e., any other event), then the thread's state is maintained in step 574, and flow continues through steps 576 and 554 to await the next event.
  • If the thread terminates as determined in step 576, the stack is sent back to initiator node 206 in step 578, and flow continues at step 554, awaiting the next event. If no event occurs, then ‘default’ occurs, which loops via steps 574 and 554 to maintain the thread's current state.
  • Often in the course of execution of the application, operating system services are needed. The application, via the runtime on a particular node, makes system calls to the operating system to obtain these services. However, the particular node making the system call may not have the resources for executing the system call. In these cases, the execution of the system call is moved to a node having the resources. FIGS. 6A-6C depict the flow of operations to execute and possible move execution of a system call. Specifically, FIG. 6A depicts a flow of operations for implementing a system call on the initiator node, according to an embodiment. FIG. 6B depicts a flow of operations for implementing a system call on the acceptor node, according to an embodiment. FIG. 6C depicts a flow of operations for implementing a Detect Local function, according to an embodiment.
  • Referring to FIG. 6A, in step 602, a thread running in the local node makes a system call. In step 604, the application monitor on the local node receives the system call via a program that is responsible for manipulating interactions with the virtualization boundary (called VpExit below). In step 606, the application monitor determines whether the arguments involve local or remote resources. In step 608, if the system call involves remote resources (‘No’ branch), then the running thread is parked, and in step 610, the application monitor sends the system call and its arguments to the application monitor on the remote node that is to handle the system call. In step 612, the application monitor on the local node awaits completion and results of the system call, and in step 614, the running thread receives the results of the system call (via VpExit) and is made active again. In step 608, if the system call involves only local resources (‘Yes’) branch, then the local node handles the system call in step 616.
  • Referring now to FIG. 6B, in step 632, the application monitor on the remote node receives the system call and its arguments. In step 634, the state of the parked thread is set to active (i.e., running) and the remote node handles the system call in step 636. In step 638, the results of the system call are returned to the thread that made the call, which provides in step 640 the results to the application monitor, after which in step 642, the state of the thread is set back to the parked state. In step 644, the application monitor sends the completion and results back to the local node.
  • Referring now to FIG. 6C, the flow of operations depicted in the figure occurs in response to executing step 606. In step 652, the function gets all of the system call arguments and in step 654 determines for system calls, other than a file access, whether the arguments interact with a resource pinned on another node, which is either a different acceptor node or the initiator node. If so, then the function returns ‘True’ in step 656. Otherwise, the function returns ‘False’ in step 658. If the system call is a file access, then the flow executes step 655, which is further described with reference to FIG. 9.
  • FIG. 7 depicts a flow of operations for loading a program file and a dynamic linker, according to an embodiment. The flow of operations of FIG. 7 describes in more detail the step of loading the application according to step 432 of FIG. 4C, where the loading is performed by the operating system, the application monitor, and the dynamic linker.
  • In step 702, application monitor 340 loads the ELF program file and gets a file system path for the ELF interpreter binary. In step 706, application monitor 340 prepares an initial stack frame for a binary of application program 314 (hereinafter referred to as “primary binary”). In step 708, application monitor 340 acquires the primary binary using the ELF interpreter and informs the binary of the initial stack frame. In step 708, application monitor 340 starts DL 344, which was loaded by operating system 310. In step 710, DL 344 runs, and in step 712, DL 344 relocates the primary binary and DL 344 to executable locations, which are locations in system memory from which code execution is allowed by the OS. In step 714, DL 344 loads the program dependencies (of the library or other deployable module) and alters the system call table to intercept all system calls made by the primary binary. Some system calls are allowed through unchanged, while others are altered when DL 344 interacts with operating system 310. In step 716, DL 344 causes the relocated primary binary of application program 314 to run at the executable location. As a result, both application program 314 and DL 344 run in userspace. Running in user space allows loading of the library or other deployable to be within the virtualization boundary.
  • DL 344 can replace certain function calls that go through the library or other deployable modules with customized versions to add functional augmentation based on known semantics. In allocating address space using ‘mmap’ or ‘sbreak’, DL 344 assures, via the application monitor, that threads see a consistent view of the address space, so execution of threads may migrate over the nodes. In addition, a ‘ptrace’ system call is used to track the execution of DL 344 to find how it interacts with operating system 310. Interactions are then rewritten so that they run coherently between initiator node 206 and acceptor node 208. Ultimately, all interactions with operating system 310 go through symbols defined by DL 344 or resolved through DL 344.
  • FIGS. 8A-8D describe the components and operations in more detail during the setup of the initiator node and acceptor node corresponding to steps 404, 442, 446, 450, 464, 466 of FIGS. 4A-4D. Specifically, FIG. 8A depicts components in an initiator node and an acceptor node involved in setting up the initiator and acceptor nodes, according to an embodiment. FIG. 8B depicts a flow of operations between initiator and acceptor nodes during address space synchronization, according to an embodiment. FIG. 8C depicts a flow of operation between an initiator and acceptor nodes during the creation of a coherent application, according to an embodiment. FIG. 8D depicts a flow of operations between an initiator and acceptor nodes during the establishment of runtimes, according to an embodiment.
  • Referring to FIG. 8A, initiator node 206 includes a VProcess 802, a Runtime module 804, a Bootstrap module 806, and a VpExit module 808. Acceptor node 208 includes similar components 822, 824, 826, 828 as on initiator node 206, along with an Init module 830. VpExit modules 808 and 828 are responsible for manipulating VProcess 802 and 822 interactions across their respective virtualization boundaries.
  • Referring now to FIG. 8B, in step 832, the acceptor Init module 830 receives a ‘hello function’ designating the address space from initiator Runtime 804. In step 834, acceptor Init module 830 sends a ‘create VpExit’ message to acceptor Bootstrap module 826. In step 836, acceptor Init module 830 sends an acknowledgment regarding the address space message back to initiator node 206. At this point, a synchronized address space is established between the initiator 207 and acceptor 208.
  • Referring to FIG. 8C, in step 838, initiator node 206 sends a ‘create VpExit’ message to initiator Bootstrap module 806. In step 840, initiator node 206 sends a ‘create’ message to VProcess 802 of initiator node 206, which receives a ‘load VpExit’ message in step 842 from initiator 206. At this point, VProcess 802 is created outside of the Remote Procedure Call (RPC) layer, and the resources that VProcess 802 uses are virtualized. In step 844, VProcess 802 sends a ‘Mmap’ message to VpExit module 808 of initiator node 206, which sends a ‘mmap’ message in step 845 to initiator 206 and an ‘update the address map’ message in step 846 to Bootstrap module 826 of acceptor node 208. In step 848, Bootstrap module 826 of acceptor node 208 sends an acknowledgment (‘ok’) back to initiator node 206, which relays the message in step 850 to VpExit module 808, which relays the message to VProcess 802 in step 852. At this point, the address map of the application on the initiator is made coherent with the acceptor node.
  • Referring to FIG. 8D, in step 854, initiator VProcess 802 sends a ‘VpExit(Enter, hook_page)’ message to VpExit module 808. In step 856, VpExit module 808 sends an ‘Enter(hook_page)’ message to initiator Bootstrap module 806. In step 858, initiator Bootstrap module 806 sends a create(VpExit)′ message to initiator Runtime 804. In step 860, initiator Bootstrap module 806 sends a ‘bootstrap(Runtime, hook_page)’ message to acceptor Bootstrap module 826, which sends in step 862 an ‘install(VpExit, hook_page)’ message to acceptor Runtime module 824. In step 864, acceptor Runtime module 824 sends an ‘install(VpExit)’ message to acceptor VProcess 822. In step 866, acceptor Bootstrap module 826 sends a ‘Runtime’ message to initiator Bootstrap module 806, which returns in step 868 to VpExit module 808, which returns in step 870 to VProcess 802. At this point, initiator node 206 and acceptor node 208 have both created runtimes for VProcess 802, VProcess 822, and the memory and address space for VProcess 802 and 822 are coherent.
  • During bootstrap, initiator node 206, in one embodiment, uses the system ‘ptrace’ facility to intercept system calls generated by the virtual process. The application monitor runs in the same address space as the virtual process, which means that the application monitor is in the same physical process as the virtual process. In one embodiment, Linux's clone(2) system call allows the virtual process to be traced. The virtual process issues SIGSTOP to itself, which pauses execution of the virtual process before allocating any virtual process resources. The application monitor attaches to the virtual process via ‘ptrace’, which allows it to continue execution (using SIGCONT) from the point at which the virtual process entered SIGSTOP. Using ‘ptrace’, the application monitor can intercept and manipulate any system calls issued by the virtual process to preserve the virtualization boundary. After bootstrap, VProcess interactions with the operating system are detected by the syscall intercept library.
  • FIG. 9 depicts a flow of operations for accessing a file, according to an embodiment. As mentioned above, a file system resides on each of the nodes. Access to one or more files in the file systems may be requested by the application during execution by making a system call. If the requested file resides on a node making the system call, the file is available locally. However, if the file resides on a different node (another acceptor node or the initiator node), the system call is remotely executed according to FIGS. 6A-6C. According to step 655 of FIG. 6C, the system call determines whether the arguments of the system call interact with a remote pinned resource, which is a file that is not local to the node receiving the system call. The steps of FIG. 9 depict the use of the file descriptor, which was returned during a previous system call in which the file was opened to determine which node on which the system call is to be executed.
  • Referring to FIG. 9, in step 900, the flow tests the file descriptor against a criterion. In one embodiment, the criterion is whether the file descriptor obtained in step 654 of FIG. 6C (during an open(filename) or other system call which returns the file descriptor fd) is even or not. If the file descriptor is an even integer, as determined in step 902, initiator node 206 is determined to have the file in step 904 because only files with even fds can be stored on the initiator. If the current node is initiator node 206, as determined in step 910, then a ‘False’ value is returned in step 916. The ‘False’ value indicates that the system call arguments do not interact with a remote pinned resource, and the system call is handled locally. If the current node is acceptor node 208 as determined in step 912, then a ‘True’ value is returned in step 914. The ‘True’ value indicates that the system call arguments do interact with a remote pinned resource, and the system call is to be handled remotely.
  • If the file descriptor is an odd integer, then acceptor node 208 is determined to have the file in step 906 because only files with an odd fds can be stored on the acceptor node, and in step 916, a ‘False’ value is returned, where an odd fd is one that is odd modulo the number of acceptor nodes (i.e., odd=fd mod #acceptors). Otherwise, a ‘True’ value is returned in step 914, where ‘False’ indicates the needed resource is local and a ‘True’ indicates that the needed resource is remote.
  • In an alternative embodiment, the criterion is whether the file descriptor is less than a specified integer, say 512. If so, as determined in step 902, initiator node 206 is determined to have the file in step 904 because only files with fds less than 512 are stored on the initiator. If the current node is initiator node 206, as determined in step 910, then a ‘False’ value is returned in step 916. The ‘False’ value indicates that the system call arguments do not interact with a remote pinned resource, and the system call is handled locally. If the current node is acceptor node 208 as determined in step 912, then a ‘True’ value is returned in step 914. The ‘True’ value indicates that the system call arguments do interact with a remote pinned resource, and the system call is to be handled remotely.
  • If the file descriptor is greater than 512, then acceptor node 208 is determined to have the file in step 906 because only files with fds greater than 512 are stored on the acceptor node, and in step 916, a ‘False’ value is returned. Otherwise, a ‘True’ value is returned in step 914.
  • Certain embodiments as described above involve a hardware abstraction layer on top of a host computer. The hardware abstraction layer allows multiple contexts to share the hardware resource. These contexts are isolated from each other in one embodiment, each having at least a user application program running therein. The hardware abstraction layer thus provides benefits of resource isolation and allocation among the contexts. In the foregoing embodiments, virtual machines are used as an example for the contexts and hypervisors as an example for the hardware abstraction layer. As described above, each virtual machine includes a guest operating system in which at least one application program runs. It should be noted that these embodiments may also apply to other examples of contexts, such as containers not including a guest operating system, referred to herein as “OS-less containers” (see, e.g., www.docker.com). OS-less containers implement operating system-level virtualization, wherein an abstraction layer is provided on top of the kernel of an operating system on a host computer. The abstraction layer supports multiple OS-less containers, each including an application program and its dependencies. Each OS-less container runs as an isolated process in userspace on the host operating system and shares the kernel with other containers. The OS-less container relies on the kernel's functionality to make use of resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces and to completely isolate the application program's view of the operating environments. By using OS-less containers, resources can be isolated, services restricted, and processes provisioned to have a private view of the operating system with their own process ID space, file system structure, and network interfaces. Multiple containers can share the same kernel, but each container can be constrained only to use a defined amount of resources such as CPU, memory, and I/O.
  • Certain embodiments may be implemented in a host computer without a hardware abstraction layer or an OS-less container. For example, certain embodiments may be implemented in a host computer running a Linux® or Windows® operating system.
  • The various embodiments described herein may be practiced with other computer system configurations, including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
  • One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer-readable media. The term computer-readable medium refers to any data storage device that can store data which can thereafter be input to a computer system. Computer-readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer. Examples of a computer-readable medium include a hard drive, network-attached storage (NAS), read-only memory, random-access memory (e.g., a flash memory device), a CD (Compact Discs)—CD-ROM, a CDR, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The computer-readable medium can also be distributed over a network-coupled computer system so that the computer-readable code is stored and executed in a distributed fashion.
  • Although one or more embodiments of the present invention have been described in some detail for clarity of understanding, it will be apparent that certain changes and modifications may be made within the scope of the claims. Accordingly, the described embodiments are to be considered as illustrative and not restrictive, and the scope of the claims is not to be limited to details given herein but may be modified within the scope and equivalents of the claims. In the claims, elements and/or steps do not imply any particular order of operation unless explicitly stated in the claims.
  • Plural instances may be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the appended claim(s).

Claims (20)

What is claimed is:
1. A method for allocating and using file descriptors for an application executing over a plurality of nodes, including a first node and a second node each having a file system, the method comprising:
executing a system call from the application running on a first node to access a file in a file system;
determining whether the file resides in a file system of the first node or the second node;
upon determining that the file resides on the second node, sending the system call and arguments of the system call to the second node for execution on the second node, and returning a result to the application on the first node.
2. The method of claim 1, wherein a file descriptor is obtained by the application performing an open system call using the file name of the file to be accessed, and determining whether the file resides in a file system of the first node or one of the second nodes includes testing the file descriptor against a criterion.
3. The method of claim 1, wherein the system call from the application running on the first node is executed by a thread running on the first node, said method further comprising:
setting the thread to a parked state when the system call and arguments are sent to the second node for execution.
4. The method of claim 3, further comprising:
setting the thread to a running state when the result is returned to the application on the first node.
5. The method of claim 4, further comprising:
if the file resides on the first node, handling the system call on the first node and returning the result to the application on the first node.
6. The method of claim 1, wherein only files with a file descriptor meeting the criterion are stored on the first node.
7. The method of claim 1, wherein only files with a file descriptor not meeting the criterion are stored on the second node.
8. A system for allocating and using file descriptors for an application executing over a plurality of nodes, the system comprising:
a first node having a file system installed thereon; and
a second node having a file system installed thereon, wherein the first node is configured to:
in response to a system call to access a file made by the application running on the first node:
determine whether the file resides in a file system of the first node or the second node; and
upon determining that the file resides in the second node, send the system call and arguments thereof to the second node for execution on the second node, and return the result to the application on the first node.
9. The system of claim 8, wherein a file descriptor is obtained by the application performing an open system call using the file name of the file to be accessed, and determining whether the file resides in a file system on the first node or one of the second nodes includes testing the file descriptor against a criterion.
10. The system of claim 8, wherein the system call from the application running on the first node is executed by a thread running on the first node, and the first node is further configured to set the thread to a parked state when the system call and arguments are sent to the second node for execution.
11. The system of claim 10, wherein the first node is further configured to set the thread to a running state when the result is returned to the application on the first node.
12. The system of claim 8, wherein the first node is further configured to:
if the file resides on the first node, handle the system call and return the result to the application.
13. The system of claim 8, wherein only files with a file descriptor meeting the criterion are stored on the first node.
14. The system of claim 8, wherein only files with a file descriptor not meeting the criterion are stored on the second node.
15. A non-transitory computer-readable medium comprising instructions,
which when executed, carry out a method for allocating and using file descriptors for an application executing on a plurality of nodes including a first node and a number of second nodes, the method comprising:
executing a system call from the application running on a first node to access a file in a file system;
determining whether the file resides in a file system of the first node or the second node;
upon determining that the file resides on the second node, sending the system call and arguments of the system call to the second node for execution on the second node and returning the result of the system call executed on the second node to the application on the first node.
16. The non-transitory computer-readable medium of claim 15,
wherein a file descriptor is obtained by the application performing an open system call using the file name of the file to be accessed, and determining whether the file resides in a file system of the first node or one of the second nodes includes testing the file descriptor against a criterion.
17. The non-transitory computer-readable medium of claim 15, wherein the system call from the application running on the first node is executed by a thread running on the first node and said method further comprises:
setting the thread to a parked state when the system call and arguments are sent to the second node for execution.
18. The non-transitory computer-readable medium of claim 17, wherein the method further comprises:
setting the thread to a running state when the result is returned to the application on the first node.
19. The non-transitory computer-readable medium of claim 15, wherein only files with a file descriptor meeting the criterion are stored on the first node.
20. The non-transitory computer-readable medium of claim 15, wherein only files with a file descriptor not meeting the criterion are stored on the second node.
US17/493,794 2021-03-23 2021-10-04 Allocating and using file descriptors for an application executing on a plurality of nodes Pending US20220308940A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/493,794 US20220308940A1 (en) 2021-03-23 2021-10-04 Allocating and using file descriptors for an application executing on a plurality of nodes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163164955P 2021-03-23 2021-03-23
US17/493,794 US20220308940A1 (en) 2021-03-23 2021-10-04 Allocating and using file descriptors for an application executing on a plurality of nodes

Publications (1)

Publication Number Publication Date
US20220308940A1 true US20220308940A1 (en) 2022-09-29

Family

ID=83364574

Family Applications (4)

Application Number Title Priority Date Filing Date
US17/493,781 Active 2042-04-14 US11762672B2 (en) 2021-03-23 2021-10-04 Dynamic linker for loading and running an application over a plurality of nodes
US17/493,741 Pending US20220308936A1 (en) 2021-03-23 2021-10-04 Application-level virtualization
US17/493,783 Pending US20220308950A1 (en) 2021-03-23 2021-10-04 Handling system calls during execution of an application over a plurality of nodes
US17/493,794 Pending US20220308940A1 (en) 2021-03-23 2021-10-04 Allocating and using file descriptors for an application executing on a plurality of nodes

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US17/493,781 Active 2042-04-14 US11762672B2 (en) 2021-03-23 2021-10-04 Dynamic linker for loading and running an application over a plurality of nodes
US17/493,741 Pending US20220308936A1 (en) 2021-03-23 2021-10-04 Application-level virtualization
US17/493,783 Pending US20220308950A1 (en) 2021-03-23 2021-10-04 Handling system calls during execution of an application over a plurality of nodes

Country Status (1)

Country Link
US (4) US11762672B2 (en)

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7082604B2 (en) * 2001-04-20 2006-07-25 Mobile Agent Technologies, Incorporated Method and apparatus for breaking down computing tasks across a network of heterogeneous computer for parallel execution by utilizing autonomous mobile agents
US20090240930A1 (en) * 2008-03-24 2009-09-24 International Business Machines Corporation Executing An Application On A Parallel Computer
US9542231B2 (en) * 2010-04-13 2017-01-10 Et International, Inc. Efficient execution of parallel computer programs
US8612577B2 (en) * 2010-11-23 2013-12-17 Red Hat, Inc. Systems and methods for migrating software modules into one or more clouds
US8984478B2 (en) * 2011-10-03 2015-03-17 Cisco Technology, Inc. Reorganization of virtualized computer programs
US10282196B2 (en) * 2012-04-06 2019-05-07 Oracle International Corporation System and method for moving enterprise software application components across environments
US9087191B2 (en) * 2012-08-24 2015-07-21 Vmware, Inc. Method and system for facilitating isolated workspace for applications
US10261847B2 (en) 2016-04-08 2019-04-16 Bitfusion.io, Inc. System and method for coordinating use of multiple coprocessors
US10534639B2 (en) 2017-07-06 2020-01-14 Bitfusion.io, Inc. Virtualization of multiple coprocessors
US10810117B2 (en) 2017-10-16 2020-10-20 Vmware, Inc. Virtualization of multiple coprocessor memory
US11050620B2 (en) * 2017-11-14 2021-06-29 TidalScale, Inc. Dynamic reconfiguration of resilient logical modules in a software defined server
US10949211B2 (en) 2018-12-20 2021-03-16 Vmware, Inc. Intelligent scheduling of coprocessor execution
US10802871B1 (en) 2019-05-07 2020-10-13 Vmware, Inc. Intelligent coprocessor state virtualization
US11907589B2 (en) 2019-07-08 2024-02-20 Vmware, Inc. Unified host memory for coprocessors
US11093226B2 (en) * 2019-08-14 2021-08-17 Intel Corporation Methods, systems, and apparatus for a generic firmware-based kernel library mechanism

Also Published As

Publication number Publication date
US20220308936A1 (en) 2022-09-29
US20220308950A1 (en) 2022-09-29
US11762672B2 (en) 2023-09-19
US20220308898A1 (en) 2022-09-29

Similar Documents

Publication Publication Date Title
US10225335B2 (en) Apparatus, systems and methods for container based service deployment
US11625257B2 (en) Provisioning executable managed objects of a virtualized computing environment from non-executable managed objects
US20200081745A1 (en) System and method for reducing cold start latency of serverless functions
US9760408B2 (en) Distributed I/O operations performed in a continuous computing fabric environment
US10289435B2 (en) Instruction set emulation for guest operating systems
US9875122B2 (en) System and method for providing hardware virtualization in a virtual machine environment
US6917963B1 (en) Snapshot image for the application state of unshareable and shareable data
US11301562B2 (en) Function execution based on data locality and securing integration flows
US8924965B2 (en) Memory state transfer of virtual machine-controlled peripherals during migrations of the virtual machine
US10057377B2 (en) Dynamic resolution of servers in a distributed environment
US20050010924A1 (en) Virtual resource ID mapping
US20200210241A1 (en) Method and system for gpu virtualization based on container
US10261847B2 (en) System and method for coordinating use of multiple coprocessors
US20220083364A1 (en) Reconciler sandboxes for secure kubernetes operators
Mavridis et al. Orchestrated sandboxed containers, unikernels, and virtual machines for isolation‐enhanced multitenant workloads and serverless computing in cloud
US8484616B1 (en) Universal module model
Boyd et al. Process migration: A generalized approach using a virtualizing operating system
Ueda et al. Performance competitiveness of a statically compiled language for server-side Web applications
US11762672B2 (en) Dynamic linker for loading and running an application over a plurality of nodes
US11620146B2 (en) System and method to commit container changes on a VM-based container
Jung et al. Oneos: Middleware for running edge computing applications as distributed posix pipelines
US20230229471A1 (en) Application-assisted live migration
US20230221932A1 (en) Building a unified machine learning (ml)/ artificial intelligence (ai) acceleration framework across heterogeneous ai accelerators
US11853783B1 (en) Identifying hosts for dynamically enabling specified features when resuming operation of a virtual compute instance
WO2023138453A1 (en) Container loading method and apparatus

Legal Events

Date Code Title Description
AS Assignment

Owner name: VMWARE, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CULLY, AIDAN;MILLER, VANCE;SIGNING DATES FROM 20211014 TO 20211018;REEL/FRAME:058021/0480

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION