CN108932154B - Distributed virtual machine manager - Google Patents

Distributed virtual machine manager Download PDF

Info

Publication number
CN108932154B
CN108932154B CN201810811512.4A CN201810811512A CN108932154B CN 108932154 B CN108932154 B CN 108932154B CN 201810811512 A CN201810811512 A CN 201810811512A CN 108932154 B CN108932154 B CN 108932154B
Authority
CN
China
Prior art keywords
distributed
memory access
node
operating system
virtual machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810811512.4A
Other languages
Chinese (zh)
Other versions
CN108932154A (en
Inventor
陈育彬
丁卓成
张晋
戚正伟
管海兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201810811512.4A priority Critical patent/CN108932154B/en
Publication of CN108932154A publication Critical patent/CN108932154A/en
Application granted granted Critical
Publication of CN108932154B publication Critical patent/CN108932154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45541Bare-metal, i.e. hypervisor runs directly on hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Multi Processors (AREA)

Abstract

The invention provides a distributed virtual machine manager, comprising: the virtual machine management module comprises a distributed shared bus module and a distributed non-uniform memory access module which run on each physical machine node, and abstracts a uniform interface of mass resources into a virtual machine through the distributed shared bus module and the distributed non-uniform memory access module to provide the virtual machine for an upper-layer client operating system; and the client operating system is used for realizing dUMA-TSO model and NUMA affinity setting by the distributed non-uniform memory access module. The invention abstracts the heterogeneous resources on a plurality of physical machines into one virtual machine through a virtualization aggregation technology by the distributed virtual machine manager, and provides massive resources for the client operating system running at the upper layer, thereby meeting the application scene with extremely high resource and performance requirements.

Description

Distributed virtual machine manager
Technical Field
The invention relates to the technical field of computer virtualization and distributed architecture. And in particular to a distributed virtual machine manager.
Background
Existing virtual machine managers all run on a single machine.
With the development of technologies such as machine learning and the termination of moore's law, resources on a single physical machine can no longer meet the requirements. There are two existing methods for using mass resources, one is to add more hardware, such as a super computer, to a single machine. Another method is to use a distributed system such as Spark, Hadoop, etc. to run on a physical cluster consisting of multiple machines. The former, however, is very expensive and the latter requires that the program developer have to re-write the code, and programming models such as Map-Reduce are not very friendly.
Virtualization technology has many meanings, and is referred to herein as system virtualization, i.e., virtualization software provides virtualization of a hardware Instruction Set (ISA) to a client application, and the client software is a complete operating system, such as Linux or Windows. Existing system virtualization solutions, such as Hyper-V, VMWare and VirtualBox, can only support one physical machine to provide virtual resources, i.e., "one virtual many", to multiple guest operating systems.
Disclosure of Invention
In view of the foregoing disadvantages in the prior art, an object of the present invention is to provide a Distributed Virtual Machine Manager, where the Distributed Virtual Machine Manager (dVMM) abstracts a consistent interface of a large amount of resources into one Virtual Machine for a Virtual Machine that needs to use a large amount of heterogeneous resources, and provides the Virtual Machine to an upper guest operating system. Specifically, heterogeneous resources on a plurality of physical machines are abstracted into one virtual machine through a virtualization aggregation technology through a distributed virtual machine manager, and massive virtualized CPU/GPU/FPGA, memory and I/O resources are provided for a guest operating system running on an upper layer, so that an application scene with extremely high resource and performance requirements is met.
The dVMM consists of two modules, namely a Distributed shared Bus (dVB) and a Distributed Non-Uniform Memory Access (dUMA). dVB emulates CPU and I/O resources using interrupt forwarding, while dNUMA virtualizes memory resources using distributed shared memory technology. After the guest operating system begins to run, most instructions can be directly completed locally. When the guest operating system encounters an operation that cannot be handled directly locally, it exits to the dVMM, which calls dVB or dUMA for processing depending on the type of operation. After the processing is finished, the guest operating system can continue to run.
The invention is realized by the following technical scheme.
A distributed virtual machine manager (dmms), comprising:
the virtual machine management module comprises a distributed shared bus module and a distributed non-uniform memory access module which run on each physical machine node, and the uniform interface of mass resources is abstracted into a virtual machine through the distributed shared bus module and the distributed non-uniform memory access module and is provided for an upper-layer client operating system; wherein:
the distributed shared bus module (dVB) enables CPUs between different physical machine nodes to communicate with each other and enables different I/O devices to be mounted on different physical machine nodes, providing abstraction of virtual CPUs and virtual I/O devices;
the distributed non-uniform memory access module (dUMA) enables memory resources on different physical machine node nodes to be shared and provides abstraction of distributed shared memory;
a guest operating system for the distributed non-coherent memory access module to build the dNUMA-TSO model and to perform NUMA affinity setting.
Preferably, the distributed shared bus module is provided with an interrupt chip, and the interrupt chip performs interrupt forwarding on interrupts which cannot be processed locally according to an instruction of the distributed shared bus module.
Preferably, the distributed shared bus module further provides abstraction of virtual heterogeneous devices; wherein:
the distributed shared bus module maintains a global page table for heterogeneous equipment, and is used for maintaining the mapping of physical addresses of the heterogeneous equipment on any two physical machines; and searching physical addresses corresponding to two physical machines needing to be accessed through the global page table, and performing interrupt forwarding to complete the virtualization of the heterogeneous equipment.
Preferably, the distributed non-uniform memory access module uses a distributed shared memory protocol to synchronize data between the physical machine nodes, and reduces the degree of uniformity through a dNUMA-TSO model; wherein:
the dNMA-TSO model regards a local page of a physical machine node as a write cache; writing dUMA uses the Lamport logical clock to determine the order of pages on each physical machine, and then merges the pages according to this order.
Preferably, the method for synchronizing data between physical machine nodes by using a distributed shared memory protocol by the distributed non-uniform memory access module includes the following steps:
step S1: initializing page table control authority;
step S2: the guest operating system starts to run, and when the guest operating system has a page fault exception, step S3 is executed;
step S3: and (4) exiting the guest operating system to the virtual machine management module when the page fault exception occurs in the guest operating system, setting a proper authority and reading proper data by the distributed non-uniform memory access module according to the distributed shared memory protocol, and then returning to the step S2.
Wherein, the proper authority refers to the states of Modified, Shared and Invalid in the MSI protocol; the appropriate data means that the DSM reads the latest data from Owner according to the MSI protocol.
Preferably, the distributed non-uniform memory access module uses a high-speed network for network communication, and reduces network traffic by using compression optimization.
Preferably, the compression optimization algorithm comprises the following steps:
when node paTo node pbWhen transmitting a page P, if PbThe data thereafter no longer changes, i.e. the state is not readable and writable, then paWill record the value of P as P0Called twin and records that the value of this twin is pbOf (1);
when node paTo node pbWhen transmitting a page P, if PaHas a twin value, and this twin value is pbThen p is transmittedaValue P of the current page1And twin value P0The difference between them is marked as diff (P)0,P1);
Wherein, the node paAnd node pbTwo physical machine nodes that pass pages to each other.
Preferably, the NUMA affinity setting employs any one of the following methods:
the method comprises the steps that through a machine learning algorithm, a process which can frequently access a remote physical machine node is judged, and a scheduler of a client operating system is informed to set reasonable affinity; the distributed non-uniform memory access module and the client operating system share one memory, the distributed non-uniform memory access module fills the prediction result of the real-time machine learning process in the shared memory, and the client operating system distributes the frequently-accessed process of the remote memory to the CPU of the remote physical machine node according to the result;
the second method adopts the same machine learning algorithm as the first method, interrupts the running CPU which frequently accesses the remote memory, and packs the state of the CPU and sends the state to the remote physical machine node; and the migrated CPU changes the remote memory access into the local memory access after the operation is recovered.
Preferably, the packed CPU state includes a complete context that enables the CPU to run;
the full context that enables the CPU to run includes: register status and/or interrupt chip status.
Preferably, the distributed non-uniform memory access module is further provided with an accelerator; the accelerator adopts an FPGA and is used for unloading the key operation of the virtual machine management module to the FPGA.
Wherein, the key operation refers to: according to the principle of page fault exception, if the operation performed by the CPU violates the authority in the page table, the MMU (memory management unit) will trigger a page fault exception; this process is now replaced by an FPGA; the problem with MMU is that a page is 4KB in size, so that any one or several bytes on the page can cause the whole page to be transmitted over the network, and thrashing can be severe; the significance of using FPGA is that we can set a page size to 128B (i.e., sub-page), so that the probability of network transmission due to one or a few byte modification is reduced.
Compared with the prior art, the invention has the following beneficial effects:
1. a distributed virtual machine manager is provided;
2. applications running on the guest operating system may use vast resources, including computing, memory, and I/O resources;
3. the application does not need to be modified;
4. multiple physical machines are virtualized to provide a unified interface to the guest operating system. Thus, under the condition of not changing a programming model, the program can use mass resources;
5. a mechanism that a plurality of physical machines are virtualized into one virtual machine, namely 'more virtual machines and one' is provided;
6. can run on a cheaper cluster of physical machines and the application can run directly on top of the dmms without modification.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a diagram of a distributed virtual machine manager (dVMM) architecture provided in the present invention;
FIG. 2 is a block diagram of a distributed shared bus module (dVB) architecture of a distributed virtual machine manager according to the present invention
FIG. 3 is a schematic diagram of a distributed non-uniform memory access module (dUMA) architecture of a distributed virtual machine manager according to the present invention
FIG. 4 is a MSI protocol state transition state machine diagram;
FIG. 5 is a schematic diagram of x86-TSO according to an embodiment of the present invention.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation mode and a specific operation process. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Examples
The present embodiment provides a distributed virtual machine manager,
the method comprises the following steps:
the virtual machine management module comprises a distributed shared bus module and a distributed non-uniform memory access module which run on each physical machine node, and the uniform interface of mass resources is abstracted into a virtual machine through the distributed shared bus module and the distributed non-uniform memory access module and is provided for an upper-layer client operating system; wherein:
the distributed shared bus module enables CPUs (central processing units) among different physical machine nodes to communicate with each other, enables different I/O (input/output) equipment to be mounted on different physical machine nodes, and provides abstraction of a virtual CPU (central processing unit) and the virtual I/O equipment;
the distributed non-uniform memory access module enables memory resources on different physical machine nodes to be shared and provides abstraction of distributed shared memory;
-a guest operating system for the distributed non-coherent memory access module to implement the dNUMA-TSO model and NUMA affinity settings.
A Distributed Virtual Machine Manager (dVMM) can enable CPUs/GPUs/FPGAs, memories and I/O on multiple physical machines to be used by guest operating systems through virtualization technology.
The dVMM may provide resources on multiple physical machines to the guest operating system. Fig. 1 shows the architecture of the dmms, where each machine runs dVB software and dNUMA software, which are interconnected via a high-speed network, and each physical machine is loaded with different devices, dVB connects these devices in series, abstracting a virtual machine. Eventually the dVMM provides a vast amount of available resources to the upper guest operating system.
dVB is an analog of the bus structure. Through the emulated devices in the QEMU and the emulated interrupts dVB, the I/O devices can communicate with each other through the compute resources (e.g., CPU/GPU) on multiple physical machine nodes, representing a complete virtual machine architecture.
dNUMA uses a Distributed Shared Memory (DSM) protocol to synchronize memory between physical nodes. DSM protocols may be employed including, but not limited to, the MSI protocol. The MSI protocol divides the entire memory address into multiple pages (e.g., 4KB), each having multiple states, Modified (readable and writable), Shared (read only), and Invalid (unreadable and unwritable). When the program running on the DSM reads or writes any one of the memories, corresponding operation is carried out according to the state of the current page. Each page in the whole address space has an Owner, which refers to the machine node holding the latest data of the page. When an illegal operation occurs, i.e. writing a Shared page or reading and writing an Invalid page, the dNUMA software goes to the Owner to read the latest value. The Owner is managed by an Owner management node. When a page is written, the state of the current page is changed into Modified due to the page missing exception, and when a page is read, the state of the current page is changed into Shared due to the page missing exception.
dmms are developed on the basis of QEMU-KVM (other similarly functioning virtualization software may be used as well), wherein dVB can be developed on the basis of QEMU, and dNUMA can be developed on the basis of KVM. QEMU-KVM employs an Intel VT-x based hardware-assisted virtualization scheme, similar to AMD. Taking Intel as an example, VT-x divides the environment where instructions run into two blocks, a root mode and a non-root mode. The guest operating system runs in mostly non-root mode and triggers a special interrupt, VMExit, to exit into the virtual machine emulation software, called root mode, in case emulation is required. VT-x uses a data structure called EPT to describe the usage rights of a page of memory address space. When the guest os performs an operation violating the page operation privilege, VT-x triggers an EPT vision interrupt, and the guest os exits to the virtualization software QEMU-KVM, and the dmm subsequently performs the next operation.
RDMA high-speed networks are a new network transport technology. The method has the greatest characteristic that a kernel can be bypassed, and the network card directly copies data to a memory space designated by a user, so that the expense of a kernel network protocol stack is reduced. RDMA is about two orders of magnitude lower in latency than a typical TCP/IP network.
The embodiment provides an implementation method of a Distributed Virtual Machine Manager (dVMM) for meeting the virtualization requirement of using massive Distributed resources in a cloud computing environment. The dVMM can share resources (including CPU, memory, I/O and the like) on a plurality of physical machine nodes and abstract the resources into a virtual machine with multiple virtual nodes, thereby providing massive virtualized resources for an operating system and a running program running on the virtual machine.
The dVMM mainly comprises dVB (Distributed Virtual Bus) and dUMA (Distributed Non-Uniform Memory Access).
dVB allow CPUs on different machine nodes to communicate with each other and allow different I/O devices to be mounted on different machine nodes, providing an abstraction of virtual CPUs and virtual devices. Through dVB, the client virtual machine can use massive CPU resources and heterogeneous I/O resources such as GPU/FPGA, such as magnetic disks, network cards, GPU and FPGA.
dNUMA allows memory resources on different physical nodes to be shared, providing an abstraction of distributed shared memory. dNUMA uses a distributed shared memory protocol to synchronize data between physical nodes. The simplest protocol, such as the MSI protocol, implements a memory model with sequential consistency. The method creatively provides a dNMA-TSO model to improve the performance by weakening the consistency degree.
This embodiment presents three methods to improve the performance of the dmm system. By means of machine learning, dNUMA can know the characteristics of a process accessing a remote memory node. One approach is to make the application set affinity reasonably by a semi-virtualization approach to dNUMA, thereby reducing the amount of remote memory access by the application. The second method is to send the state of the current node vCPU to a frequently accessed memory node, so that remote memory access is changed into local memory access. Thirdly, the function of sub-page is added to mmu (memory Management unit) by using acceleration hardware such as FPGA, etc. to reduce the granularity of memory sharing.
dVB lies at the core of interrupt forwarding. When a device generates an interrupt, it writes the interrupt number along with the target CPU number into the interrupt chip. Under x86, this interrupt chip is called IOAPIC. Since the target CPU is likely not local in the dVMM, dVB may need to intercept this interrupt and forward it to the target machine. The CPU also has a similar mechanism. When the CPU is ready to issue an inter-core interrupt, it triggers a VMExit and eventually dVB capture. Similar to an I/O interruption, it is eventually sent to the target physical node. FIG. 2 shows an architecture diagram of dVB. The solid lines in each physical machine represent interrupts that can be handled locally, including inter-core interrupts and I/O interrupts, while interrupts that cannot be handled locally are sent to the virtual bus for interrupt forwarding.
The interrupt chip in the virtual environment is completely emulated by software (QEMU). The interrupt chip itself does not know that the target of the interrupt to be accepted or sent may not be on the local machine, which is done at dVB and issues an interrupt forwarding instruction to the interrupt chip.
Each physical node can be mounted with zero or more devices, including a disk, a network card, a GPU, an FPGA, an NVM, and the like. dVB virtualize these devices and provide a unified interface to the upper guest operating systems.
For virtualization of heterogeneous devices such as GPUs and FPGAs, dVB maintains a global page table to maintain mapping of physical addresses on any two machines. Assuming that the GPU is mounted on machine A and machine B needs to access the GPU, dVB will first go to the global page table P through the physical address P on machine AAFind a global address pGThen through pGFinding the corresponding physical address p on machine BBAnd then interrupt forwarding is carried out, so that the virtualization of the heterogeneous equipment can be completed. Assuming N machines, the size of the global page table is O (N).
FIG. 3 shows the architecture of dNMA. On each physical machine, there is an entry for the EPT and a corresponding physical memory (DRAM). The EPT entry has a corresponding state. And different machine nodes communicate through a high-speed network. dNUMA uses the MSI protocol for inter-node memory synchronization. When the page fault exception occurs, the dUMA needs to perform data synchronization operation. Let the node with missing page exception be pxThe Owner management node is pyOwner is pz. The steps of handling the page fault exception by the dNMA are as follows:
step 1: p is a radical ofxTo pySending a request;
step 2: p is a radical ofyWill read pzThe latest page, while if the request is a write, Owner will become px,pzSet properlyThe state of (1);
and step 3: p is a radical ofxAfter reading the page, the page is filled into the memory of the guest operating system, and the appropriate state is set.
Table 1 shows the operations of dNUMA for different page states of the guest operating system, when the guest operating system performs an operation (read or write) on a page, the operation of dNUMA is corresponding.
TABLE 1
Figure BDA0001739261400000081
Where √ denotes that such an operation is allowed in the current state. FIG. 4 shows a state machine diagram of dNMA page state transitions.
dNMAs employ high-speed networks for network communications, where the networks include, but are not limited to, RDMA. RDMA is a popular high-speed network solution today, which can bypass the kernel protocol stack, allowing the network card to copy datagrams directly to the user-specified address space, thereby greatly reducing latency.
dNUMA also uses compression optimization to reduce network traffic. When p isxTo pyWhen transmitting data, if pxBefore last is stored with pyA copy of the upper page, an efficient algorithm can be used to calculate the difference between the data on the two machines. Since many operations of the guest operating system to memory involve only modifying very little data on a page (e.g., locking/unlocking), the difference between pages between operations is small. The use of compression optimization can significantly reduce the data transmitted by the network. The algorithm for compression optimization is as follows:
1. when p isxTo pyWhen transmitting a page P, if PyThe data thereafter does not change, i.e. the status is not Modified, then pxWill record the value of P as P0Called twin, and records that the value of this twin is pyOf (1);
2. when p isxTo pyWhen transmitting a page P, if PxHas a twin value, and this twin value is pyThen the value P of the current page of px is transmitted1And twin value P0The difference is marked as diff (P)0,P1). Generally speaking, diff (P)0,P1) Is much less than P1Compression optimization can reduce network transmission load.
If the MSI protocol is fully employed, the memory on the dNUMA will suffer a lot of thrashing, resulting in performance degradation. This is because the MSI protocol implements a sequential consistency model, which is too strict for most virtual machine applications. While dNMA-TSO reduces consistency strength to improve performance. Taking the x86 architecture as an example, the x86 architecture conforms to the memory model of the x86-TSO, and dUMA-TSO wants to provide the memory model of the x86-TSO to the upper layer virtual machine. The core idea of x86-TSO is that there is a write cache (Store Buffer) between the CPU and the cache, and a program running on the x86 architecture writes into the write cache first unless a special memory mode is specified. The x86-TSO diagram is shown in FIG. 5. Writes in the write cache are not seen by other CPUs, i.e., writes to the write cache are not written to memory. Theoretically, any time can pass from writing to the cache (memory system) unless the application actively calls the memory barrier. The memory barrier allows all writes in the write cache to go to memory. Under the x86 architecture, write-read misordering is allowed, while dNUMA-TSO also causes write-read misordering in a distributed scenario. The dNUMA-TSO sees the local page of the machine node as a write cache. The data in the write cache will only be seen by the CPUs on the other machine nodes in two cases:
1. a period of time (referred to as a time window) has elapsed;
2. the client machine calls the memory barrier.
When these two conditions are met, the page is broadcast to all machines. The broadcast here means that other pages are set to the Invalid state as in the write of the MSI protocol. dNUMA uses the Lamport logical clock to determine the order of pages on each machine and then merges the pages according to this order. In particular, because the Intel VT-x technology does not support exit when the guest os calls the memory barrier, dNUMA requires a custom (slightly modified) guest os that explicitly uses hyper call to operate when the memory barrier is called.
In NUMA (non-uniform memory) architectures, there is an attribute called affinity, which refers to which node a process is more inclined to run on. Through a machine learning algorithm, such as reinforcement learning, dNUMA can know which process will frequently access a remote machine node. If this happens, dUMA will notify the guest operating system's scheduler to set a reasonable affinity. Frequent access to remote memory means that there is a high probability that both local and remote processes are contending for the same resource, e.g., the same lock. The real-time machine learning process can capture this feature and predict a more rational distribution pattern. The dNUMA and guest operating system share a block of memory. The dNUMA fills the prediction result of the real-time machine learning process in the shared memory, and the guest operating system allocates the frequently accessed remote memory process to the CPU of the remote machine node according to the result.
Another approach is to migrate the CPU. With the same machine learning algorithm, dNUMA interrupts the running CPU that frequently accesses the remote memory, and packages the CPU state to the remote. The packed CPU state includes the complete context that the CPU can run, such as register state and interrupt chip state (LAPIC). The migrated CPU can change the remote memory access into the local memory access after the operation is recovered. The packed CPU state is typically about 5KB in size, only slightly larger than one page (4KB), so the packing overhead is negligible.
For typical applications, the granularity of page sharing of 4KB is large, and some of them increase many pseudo-shares. This means that modifications to any one byte on a 4KB page will cause a network transmission. dUMA uses FPGA as accelerator, and offloads (offloading) critical operations of dVMM to dedicated hardware (FPGA) to optimize system performance. For example, by adding MMU functionality, 128B sub-page logic can be implemented such that only modifications to the one 128B sub-page will cause network traffic and the traffic for network traffic will be reduced.
In this embodiment:
the dVMM can share heterogeneous resources (including CPU/GPU/FPGA, memory, I/O and the like) on a plurality of physical machine nodes and abstract the heterogeneous resources into a virtual machine with multiple virtual nodes, and a guest operating system and application software can run on the virtual machine without modification. Thereby providing a large amount of virtualized resources for guest operating systems and applications running on the virtual machine. The dVMM includes two parts, dVB and dUMA respectively, the former providing sharing of the computing resources CPU/GPU/FPGA and I/O devices and the latter providing sharing of distributed memory resources. dVB and dUMA can be implemented based on, but not limited to, QEMU and KVM, respectively.
The components of dVMM are as follows:
A. a plurality of physical machines. Each physical machine provides a portion of the computing, memory, and I/O resources;
and B.dVMM virtual machine management module. dVB and dNMA run on each physical node separately;
dVMM guest operating systems. dNUMA relies on the guest operating system to slightly modify its implementation (for performance acceleration) to achieve dNUMA-TSO and NUMA affinity settings.
dVB proposes a method for abstracting physical computing resources and IO devices into a virtual machine through a virtual bus, so that the upper guest operating systems get a consistent view of the virtualized bus. The CPU may send Inter-Processor interrupts (Inter-Processor interrupts) to other CPUs through dVB, and the I/O interrupts may also be accessed dVB, so that the I/O devices may be distributed across machines on different physical nodes. It should be noted that the I/O devices herein include, but are not limited to, conventional devices such as disks and network cards, heterogeneous acceleration devices such as FPGAs and GPUs, and new devices such as RDMA and NVM.
dNMA provides a distributed NUMA architecture implementation method, which is based on a distributed shared memory protocol to realize memory sharing of a plurality of physical machines so as to provide massive memory resources for upper-layer virtual machines. NUMA-friendly applications can run efficiently on virtual machines.
Core module dVM of the dVMM and the dUMA use a high-speed network for network communications. Network communications herein include, but are not limited to, RDMA. dNMA also uses compression optimization to reduceLow network traffic. When the physical node pxTo physical node pyWhen transmitting data, if pxBefore last is stored with pyA copy of the upper page, an efficient algorithm can be used to calculate the difference between the data on the two machines.
dNUMA proposes an optimized DSM protocol implementation, called dNUMA-TSO. This is because the conventional MSI protocol implements a sequential consistency model, which is too constrained for most virtual machine applications to result in performance degradation. dNMA-TSO reduces the strength of memory consistency. The dNUMA-TSO sees the local page of the machine node as a write cache. The data in the write cache is only broadcast to all machines under certain circumstances. dNUMA uses the Lamport logical clock to determine the order of pages on each machine and then merges the pages according to this order.
dNMA provides a method for realizing resource affinity in an extended NUMA architecture across physical nodes, and through a machine learning method, dNMA can know which process can frequently access a remote machine node. If this happens, dUMA will notify the guest operating system's scheduler to set a reasonable affinity.
dUMA proposes an optimization method for CPU thermal migration. When a process running on a vCPU frequently accesses a remote memory, dNUMA packages the state of the vCPU and sends the state to a remote machine node. The packed state includes registers, interrupt chips (LAPICs), etc.
dVMM presents an optimization method based on hardware acceleration, which can be accelerated using FPGA. The FPGA enhances the function of the MMU, and provides a mode with smaller sharing granularity for keeping consistency for the distributed shared memory of the dNMA.
The embodiment is implemented on the premise of the technical scheme and architecture of the invention, and a detailed implementation manner and a specific operation process are given, but an applicable platform is not limited to the following examples.
The specific deployment example is a cluster consisting of eight common servers, each machine having 64GB of memory resources. Each server is equipped with a network card that supports InfiniBand. The servers are connected to the central InfiniBand switch by fiber optics. The invention is not limited by the type and number of the servers, and can be extended to a cluster formed by more than eight servers.
Each Server is provided with an Ubuntu Server 18.04.1LTS 64bit and is provided with two CPUs with 56 cores and 128GB memory. Six of the eight machines are respectively provided with a disk, a network card, a GPU, an FPGA, RDMA and NVM. When starting the dVMM, the dVMM software on the eight machines is started in sequence, and the whole virtual machine starts to run after the dVMM software on all the machines is started.
The specific development of the embodiment is based on the source code versions of QEMU 2.8.1.1 and Linux kernel 4.8.10 as an illustration, and the method is also applicable to other versions of kernels of QEMU and Linux. A slightly modified operating system, Ubuntu Server 18.04.1LTS 64bit, may be run on the QEMU-KVM. IDE devices, SCSI devices, network devices, GPU devices, FPGA devices, etc. may be run on the guest operating systems. The test program with better performance on the dVMM should have NUMA-friendly characteristics, that is, a programmer should write codes according to the characteristics of the NUMA architecture, so that the codes have better locality. Graph computation, map-reduce, etc. are all good application scenarios.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (1)

1. A distributed virtual machine manager, comprising:
the virtual machine management module comprises a distributed shared bus module and a distributed non-uniform memory access module which run on each physical machine node, and the uniform interface of mass resources is abstracted into a virtual machine through the distributed shared bus module and the distributed non-uniform memory access module and is provided for an upper-layer client operating system; wherein:
the distributed shared bus module enables CPUs (central processing units) among different physical machine nodes to communicate with each other, enables different I/O (input/output) equipment to be mounted on different physical machine nodes, and provides abstraction of a virtual CPU (central processing unit) and the virtual I/O equipment;
the distributed non-uniform memory access module enables memory resources on different physical machine node nodes to be shared and provides abstraction of distributed shared memory;
a guest operating system for the distributed non-coherent memory access module to construct a dNUMA-TSO model and to perform NUMA affinity setting;
the distributed non-uniform memory access module synchronizes data among all physical machine nodes by using a distributed shared memory protocol, and weakens the degree of uniformity through a dNMA-TSO model; wherein:
the dNMA-TSO model regards a local page of a physical machine node as a write cache; writing dUMA uses the Lamport logical clock to determine the order of pages on each physical machine, and then merges the pages according to this order;
the NUMA affinity is set by any one of the following methods:
the method comprises the steps that through a machine learning algorithm, a process which can frequently access a remote physical machine node is judged, and a scheduler of a client operating system is informed to set reasonable affinity; the distributed non-uniform memory access module and the client operating system share one memory, the distributed non-uniform memory access module fills the prediction result of the real-time machine learning process in the shared memory, and the client operating system distributes the frequently-accessed process of the remote memory to the CPU of the remote physical machine node according to the result;
the second method adopts the same machine learning algorithm as the first method, interrupts the running CPU which frequently accesses the remote memory, and packs the state of the CPU and sends the state to the remote physical machine node; the migrated CPU changes the remote memory access into the local memory access after the operation is recovered;
the distributed shared bus module is provided with an interrupt chip, and the interrupt chip interrupts and forwards interrupts which cannot be processed locally according to the instruction of the distributed shared bus module;
the distributed shared bus module also provides abstraction of the virtual heterogeneous devices; wherein:
the distributed shared bus module maintains a global page table for heterogeneous equipment, and is used for maintaining the mapping of physical addresses of the heterogeneous equipment on any two physical machines; searching physical addresses corresponding to two physical machines needing to be accessed through a global page table, and then carrying out interrupt forwarding to complete virtualization of heterogeneous equipment;
the method for synchronizing data between the physical machine nodes by using the distributed non-uniform memory access module through the distributed shared memory protocol comprises the following steps:
step S1: initializing page table control authority;
step S2: the guest operating system starts to run, and when the guest operating system has a page fault exception, step S3 is executed;
step S3: when the guest operating system is abnormal in page fault, the guest operating system exits to the virtual machine management module, the distributed non-uniform memory access module sets the authority and reads the data according to the distributed shared memory protocol, and then the step S2 is returned;
the distributed non-uniform memory access module adopts a high-speed network for network communication and reduces network flow by using compression optimization;
the packed CPU state includes a complete context that enables the CPU to run;
the full context that enables the CPU to run includes: register state and/or interrupt chip state;
the distributed non-uniform memory access module is also provided with an accelerator; the accelerator adopts an FPGA and is used for unloading the key operation of the virtual machine management module to the FPGA;
the compression optimization algorithm comprises the following steps:
when node paTo node pbWhen transmitting a page P, if PbThe data thereafter no longer changes, i.e. the state is not readable and writable, then paWill record the value of PIs P0Called twin and records that the value of this twin is pbOf (1);
when node paTo node pbWhen transmitting a page P, if PaHas a twin value, and this twin value is pbThen p is transmittedaValue P of the current page1And twin value P0The difference is marked as diff (P)0,P1);
Wherein, the node paAnd node pbTwo physical machine nodes that pass pages to each other.
CN201810811512.4A 2018-07-23 2018-07-23 Distributed virtual machine manager Active CN108932154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810811512.4A CN108932154B (en) 2018-07-23 2018-07-23 Distributed virtual machine manager

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810811512.4A CN108932154B (en) 2018-07-23 2018-07-23 Distributed virtual machine manager

Publications (2)

Publication Number Publication Date
CN108932154A CN108932154A (en) 2018-12-04
CN108932154B true CN108932154B (en) 2022-05-27

Family

ID=64444142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810811512.4A Active CN108932154B (en) 2018-07-23 2018-07-23 Distributed virtual machine manager

Country Status (1)

Country Link
CN (1) CN108932154B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069440A (en) * 2019-03-11 2019-07-30 胡友彬 Meteorological ocean data Processing Algorithm Hardware system and method based on heterogeneous polynuclear
CN110569105B (en) * 2019-08-14 2023-05-26 上海交通大学 Self-adaptive memory consistency protocol of distributed virtual machine, design method thereof and terminal
CN111090531B (en) * 2019-12-11 2023-08-04 杭州海康威视系统技术有限公司 Method for realizing distributed virtualization of graphic processor and distributed system
CN111273860B (en) * 2020-01-15 2022-07-08 华东师范大学 Distributed memory management method based on network and page granularity management
CN112214302B (en) * 2020-10-30 2023-07-21 中国科学院计算技术研究所 Process scheduling method
CN114281529B (en) * 2021-12-10 2024-06-04 上海交通大学 Method, system and terminal for dispatching optimization of distributed virtualized client operating system
CN115150458A (en) * 2022-05-20 2022-10-04 阿里云计算有限公司 Device management system and method
CN117112466B (en) * 2023-10-25 2024-02-09 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680987B1 (en) * 2006-03-29 2010-03-16 Emc Corporation Sub-page-granular cache coherency using shared virtual memory mechanism
CN107491340A (en) * 2017-07-31 2017-12-19 上海交通大学 Across the huge virtual machine realization method of physical machine
CN107967180A (en) * 2017-12-19 2018-04-27 上海交通大学 Based on resource overall situation affinity network optimized approach and system under NUMA virtualized environments
CN108021429A (en) * 2017-12-12 2018-05-11 上海交通大学 A kind of virutal machine memory and network interface card resource affinity computational methods based on NUMA architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680987B1 (en) * 2006-03-29 2010-03-16 Emc Corporation Sub-page-granular cache coherency using shared virtual memory mechanism
CN107491340A (en) * 2017-07-31 2017-12-19 上海交通大学 Across the huge virtual machine realization method of physical machine
CN108021429A (en) * 2017-12-12 2018-05-11 上海交通大学 A kind of virutal machine memory and network interface card resource affinity computational methods based on NUMA architecture
CN107967180A (en) * 2017-12-19 2018-04-27 上海交通大学 Based on resource overall situation affinity network optimized approach and system under NUMA virtualized environments

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems;Peter J. Keleher,Alan L. Cox,Willy Zwaenepoel;《USENIX Winter》;19940131;23-26 *
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors";Peter Sewell, Susmit Sarkar, Scott Owens,etc;《Communication of the ACM》;20100731;89-97 *

Also Published As

Publication number Publication date
CN108932154A (en) 2018-12-04

Similar Documents

Publication Publication Date Title
CN108932154B (en) Distributed virtual machine manager
Aguilera et al. Remote memory in the age of fast networks
US9619270B2 (en) Remote-direct-memory-access-based virtual machine live migration
Chen et al. Enabling FPGAs in the cloud
US20190004836A1 (en) Parallel hardware hypervisor for virtualizing application-specific supercomputers
EP3084612B1 (en) A memory appliance for accessing memory
CN100504789C (en) Method for controlling virtual machines
US7254676B2 (en) Processor cache memory as RAM for execution of boot code
US10176007B2 (en) Guest code emulation by virtual machine function
Macdonell Shared-memory optimizations for virtual machines
US10019276B2 (en) Dynamic non-uniform memory architecture (NUMA) locality for remote direct memory access (RDMA) applications
US20210117244A1 (en) Resource manager access control
US11301142B2 (en) Non-blocking flow control in multi-processing-entity systems
US11983555B2 (en) Storage snapshots for nested virtual machines
US11748136B2 (en) Event notification support for nested virtual machines
US11900142B2 (en) Improving memory access handling for nested virtual machines
US11860792B2 (en) Memory access handling for peripheral component interconnect devices
US11526358B2 (en) Deterministic execution replay for multicore systems
Landgraf et al. Reconfigurable Virtual Memory for FPGA-Driven I/O
US20210303326A1 (en) Transparent huge pages support for encrypted virtual machines
US11003488B2 (en) Memory-fabric-based processor context switching system
US20240220294A1 (en) VM Migration Using Memory Pointers
US20230418509A1 (en) Switching memory consistency models in accordance with execution privilege level
Shan Distributing and Disaggregating Hardware Resources in Data Centers
US11886351B2 (en) Memory efficient virtual address management for system calls

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant