CN108932154A - A kind of distributed virtual machine manager - Google Patents

A kind of distributed virtual machine manager Download PDF

Info

Publication number
CN108932154A
CN108932154A CN201810811512.4A CN201810811512A CN108932154A CN 108932154 A CN108932154 A CN 108932154A CN 201810811512 A CN201810811512 A CN 201810811512A CN 108932154 A CN108932154 A CN 108932154A
Authority
CN
China
Prior art keywords
distributed
virtual machine
node
operating system
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810811512.4A
Other languages
Chinese (zh)
Other versions
CN108932154B (en
Inventor
陈育彬
丁卓成
张晋
戚正伟
管海兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201810811512.4A priority Critical patent/CN108932154B/en
Publication of CN108932154A publication Critical patent/CN108932154A/en
Application granted granted Critical
Publication of CN108932154B publication Critical patent/CN108932154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45541Bare-metal, i.e. hypervisor runs directly on hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Abstract

The present invention provides a kind of distributed virtual machine managers, it include: Virtual Machine Manager module, including the Distributed sharing bus module operated on each physical machine node and distributed Non Uniform Memory Access access modules, by Distributed sharing bus module and distributed Non Uniform Memory Access access modules, the uniform interface of vast resources is abstracted as a virtual machine and is supplied to upper-layer client operating system;Client operating system realizes dNUMA-TSO model and the setting of NUMA compatibility for distributed Non Uniform Memory Access access modules.Heterogeneous resource on multiple physical machines is passed through virtualization polymerization technique by distributed virtual machine manager by the present invention, it is abstracted as a virtual machine, client operating system for upper layer operation provides the resource of magnanimity, to meet the application scenarios for having high resource and performance requirement.

Description

A kind of distributed virtual machine manager
Technical field
The present invention relates to computer virtualized and distributed structure/architecture technical fields.In particular it relates to a kind of distributed virtual Machine manager.
Background technique
Existing virtual machine manager all operates on single machine.
With the development of the technologies such as machine learning and the termination of Moore's Law, the resource on separate unit physical machine is no longer It is able to satisfy needs.Existing there are two ways to using vast resources, one is more hardware are added on single machine, such as Supercomputer.Another method is using Spark, and Hadoop distributed system operates in the object being made of more machines It manages on cluster.But the former the price is very expensive, and the latter requires program developer that must rewrite code, such as Map- The programming models such as Reduce are not very friendly.
There are many meaning, referred to herein as system virtualization, i.e. virtualization softwares to mention to client application for virtualization technology For the virtual of hardware instruction set (ISA), and this client software is a complete operating system, such as Linux or Windows.Existing system virtualization solution, such as Hyper-V, VMWare and VirtualBox can only support an object It manages machine and provides virtual resource to multiple client operating systems, is i.e. " one is empty more ".
Summary of the invention
Aiming at the above shortcomings existing in the prior art, it is an object of the present invention to provide a kind of distributed virtual machine management Device, the distributed virtual machine manager (dVMM, Distributed Virtual Machine Manager) for need using The uniform interface of vast resources is abstracted as a virtual machine and is supplied to upper-layer client operation system by the virtual machine of magnanimity heterogeneous resource System.It specifically, is that the heterogeneous resource on multiple physical machines is passed through by distributed virtual machine manager by virtualization polymerization Technology is abstracted as a virtual machine, and the client operating system for upper layer operation provides the virtualization CPU/GPU/FPGA, interior of magnanimity Deposit with I/O resource, so that meeting has the application scenarios of high resource and performance requirement.
DVMM is made of two modules, i.e., Distributed sharing bus (dVB, Distributed Virtual Bus) and point Cloth Non Uniform Memory Access accesses (dNUMA, Distributed Non-Uniform Memory Access).DVB uses interruption CPU and I/O resource is simulated in forwarding, and dNUMA carries out virtually memory source using distributed shared memory technology.Visitor After family operating system brings into operation, overwhelming majority instruction can be done directly locally.When client operating system is encountered at this When the operation that ground can not be handled directly, it will be withdrawn into dVMM, dVMM is called at dVB or dNUMA according to the type of operation Reason.After treatment, client operating system are continued to run.
The present invention is achieved by the following technical solutions.
A kind of distributed virtual machine manager (dVMM), comprising:
Virtual Machine Manager module, including the Distributed sharing bus module that operates on each physical machine node and Distributed Non Uniform Memory Access access modules access mould by Distributed sharing bus module and distributed Non Uniform Memory Access The uniform interface of vast resources is abstracted as a virtual machine and is supplied to upper-layer client operating system by block;Wherein:
The Distributed sharing bus module (dVB) communicates with each other the CPU between different physical machine nodes, and makes The different I/O equipment of carry on different physical machine nodes provides the abstract of virtual cpu and virtual i/o equipment;
The memory money that the distribution Non Uniform Memory Access access modules (dNUMA) put different physical machines successively Source is shared, provides the abstract of distributed shared memory;
Client operating system, for distributed Non Uniform Memory Access access modules building dNUMA-TSO model and progress The setting of NUMA compatibility.
Preferably, the Distributed sharing bus module, which is equipped with, interrupts chip, and the interruption chip is according to Distributed sharing The instruction of bus module, it is impossible to carry out interruption forwarding in the interruption of processing locality.
Preferably, the Distributed sharing bus module also provides the abstract of virtual heterogeneous device;Wherein:
Distributed sharing bus module is that heterogeneous device safeguards a global page table, to safeguard any two physical machines The mapping of the physical address of upper heterogeneous device;It is corresponding physically that two physical machines for needing to access are searched by global page table The underway disconnected forwarding in location, completes the virtualization of heterogeneous device.
Preferably, the distributed Non Uniform Memory Access access modules synchronize each object using distributed shared memory agreement The data between machine node are managed, and weaken the degree of consistency by dNUMA-TSO model;Wherein:
The native page of physical machine node is regarded as a write buffer by the dNUMA-TSO model;Write dNUMA use Lamport logical timer determines the sequence of page on every physical machine, then sequentially these pages is merged according to this.
Preferably, the distributed Non Uniform Memory Access access modules synchronize each object using distributed shared memory agreement The method for managing the data between machine node, includes the following steps:
Step S1: initialization page table control authority;
Step S2: client operating system brings into operation, and when page fault occurs in client operating system, executes step S3;
Step S3: there is page fault in client operating system, is withdrawn into Virtual Machine Manager module, distributed nonuniformity Internal storage access module is arranged suitable permission according to distributed shared memory agreement and reads data appropriate, then returns to step Rapid S2.
Wherein, suitable permission refers to Modified, Shared and Invalid state in MSI agreement;Data appropriate Refer to that DSM reads newest data according to MSI agreement from Owner.
Preferably, the distributed Non Uniform Memory Access access modules carry out network communication using high speed network, and use Compression optimization reduces network flow.
Preferably, the algorithm of the compression optimization includes the following steps:
As node paTo node pbWhen transmitting a page P, if pbData later no longer change, i.e., state is not readable It is writeable, then paThe value that P will be recorded, is denoted as P0, referred to as twin, and recording this twin value is pb's;
As node paTo node pbWhen transmitting a page P, if paOn have twin value, and this twin value is pb, that Just transmission paThe value P of current page1With twin value P0Difference, be denoted as diff (P0,P1);
Wherein, node paWith node pbFor two physical machine nodes for transmitting page mutually.
Preferably, the NUMA compatibility setting is using any one following method:
Method one judges the process that can frequently access the physical machine node of distal end, and lead to by the algorithm of machine learning Know that reasonable compatibility is arranged in the scheduler of client operating system;Due to cloth Non Uniform Memory Access access modules and guest operation System shares one piece of memory, and cloth Non Uniform Memory Access access modules insert real-time machine study course in this block shared drive Prediction result, the process of the remote memory frequently accessed is assigned to distal end physical machine according to this result by client operating system On the CPU of device node;
Method two is interrupted in access be currently running and frequent distal end using machine learning algorithm identical with method one The CPU deposited, and the state of CPU is transmitted and gives distal end physical machine node;CPU after migration is after resuming operation distal end Internal storage access becomes local memory access.
Preferably, the CPU state of the packing includes the complete context that CPU can be allowed to run;
The complete context that CPU can be allowed to run includes: buffer status and/or interruption chip status.
Preferably, the distributed Non Uniform Memory Access access modules are additionally provided with accelerator;The accelerator uses FPGA, For the key operation of Virtual Machine Manager module to be unloaded to FPGA.
Wherein, key operation refers to: according to the principle of page fault, if the operation that CPU is carried out violates in page table Permission, then MMU (memory management unit) can trigger a page fault;This present process is substituted by FPGA;MMU's asks Topic is that the size of a page is 4KB, and the modification of this page so upper any one or several bytes all can cause entire page in net Transmission on network, phenomenon of jolting can be than more serious;And it is that we can be sized to page using the meaning of FPGA 128B (i.e. subpage), in this way since the modification of one or several bytes causes the probability of network transmission that will reduce.
Compared with prior art, the invention has the following beneficial effects:
1, a kind of distributed virtual machine manager is provided;
2, the resource of magnanimity can be used in the application run on client operating system, including calculates, memory and I/O resource;
3, using without modifying;
4, more physical machines are virtualized, unified interface is provided to client operating system.Do not changing in this way Under conditions of programming model, program is able to using vast resources;
5, the mechanism that more physical machines invent a virtual machine is provided, i.e. " more void one ";
6, it may operate on relatively inexpensive physical machine cluster, and application can be run directly in without modification On dVMM.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is distributed virtual machine manager provided by the present invention (dVMM) configuration diagram;
Fig. 2 is that Distributed sharing bus module (dVB) framework of distributed virtual machine manager provided by the present invention shows It is intended to
Fig. 3 is the distributed Non Uniform Memory Access access modules of distributed virtual machine manager provided by the present invention (dNUMA) configuration diagram
Fig. 4 is MSI protocol status transition state machine figure;
Fig. 5 is the x86-TSO schematic diagram of one embodiment of the invention.
Specific embodiment
Elaborate below to the embodiment of the present invention: the present embodiment carries out under the premise of the technical scheme of the present invention Implement, the detailed implementation method and specific operation process are given.It should be pointed out that those skilled in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.
Embodiment
A kind of distributed virtual machine manager is present embodiments provided,
Include:
Virtual Machine Manager module, including the Distributed sharing bus module that operates on each physical machine node and Distributed Non Uniform Memory Access access modules access mould by Distributed sharing bus module and distributed Non Uniform Memory Access The uniform interface of vast resources is abstracted as a virtual machine and is supplied to upper-layer client operating system by block;Wherein:
The Distributed sharing bus module communicates with each other the CPU between different physical machine nodes, and makes difference The different I/O equipment of carry on physical machine node provides the abstract of virtual cpu and virtual i/o equipment;
The distribution Non Uniform Memory Access access modules make the memory source on different physical machine nodes shared, mention For the abstract of distributed shared memory;
Client operating system realizes dNUMA-TSO model and NUMA parent for distributed Non Uniform Memory Access access modules With property setting.
Distributed virtual machine manager (dVMM, Distributed Virtual Machine Manager) passes through virtual Change technology can use CPU/GPU/FPGA, memory and the I/O on more physical machines to client operating system.
DVMM can provide the resource on more physical machines to client operating system.Fig. 1 illustrates the framework of dVMM, Each machine operation has the software of dVB and dNUMA, they are interconnected by high speed network, and every physical machine is mounted with difference Equipment, dVB is together in series these equipment, is abstracted into a virtual machine.Final dVMM is provided to upper-layer client operating system The available resource of magnanimity.
DVB is the simulation to bus structures.By the analog machine and simulation interruption in QEMU, dVB makes more objects The computing resource (such as CPU/GPU) on machine node is managed, I/O equipment can communicate with each other, and embody complete virtual machine system Structure.
DNUMA carries out the synchronization of memory between physical node using distributed shared memory (DSM) agreement.It can be used DSM agreement includes but is not limited to MSI agreement.Entire memory address is divided into multiple pages (such as 4KB) by MSI agreement, and each page has Multiple states, respectively Modified (readable writeable), Shared (read-only) and Invalid (unreadable not writeable).Work as operation When the program on DSM reads or writes any one piece of memory, corresponding operation can be carried out according to the state of current page.Entirely Each of address space page has a Owner, this Owner refers to the machine node for preserving this page of latest data. When illegal operation occurs, that is, when writing one Shared pages or one Invalid pages of read-write, dNUMA software will go at Owner to read Take newest value.Owner is managed by an Owner management node.The shape of current page can be made by writing a page generation page fault State becomes Modified, and reading a page generation page fault can make the state of current page become Shared.
DVMM is developed (virtualization software that other similar function can also be used) on the basis of QEMU-KVM, wherein DVB can be developed on the basis of QEMU, and dNUMA can be developed on the basis of KVM.QEMU-KVM use with Hardware based on Intel VT-x assists virtualization scheme, and the scheme of AMD is also similar.By taking Intel as an example, VT-x transports instruction Capable environment is divided into two pieces, is root mode and non-root mode respectively.Client operating system in most cases operates in non-root mould In formula, and a kind of special interruption is triggered in the case where needing and simulating, i.e. VMExit works as to be withdrawn into virtual machine simulation softward In, referred to as root mode at this time.VT-x describes the right to use of the page of memory address space using the data structure of entitled EPT a kind of Limit.When client operating system occurs to violate the operation of the page operations permission, VT-x will trigger EPT Violation and (skip leaf Abnormal, be one kind of VMexit) interruption, client operating system will be withdrawn into virtualization software QEMU-KVM, at this time DVMM can take over next operation.
RDMA high speed network is a kind of new network transmission technology.Its maximum feature is can be around kernel, and network interface card is straight It connects and copies data to the memory headroom that user specifies, reduce the expense of kernel network protocol stack.RDMA is than general TCP/ The low two numbers magnitude of the delay of IP network.
The present embodiment proposes a kind of point for the virtualization demand for using magnanimity distributed resource under cloud computing environment Cloth virtual machine manager dVMM (Distributed Virtual Machine Manager, distributed virtual machine manager) Implementation method.DVMM can be shared the resource (including CPU, memory and I/O etc.) on multiple physical machine nodes, is abstracted into One " more void one " virtual machine, so that the operating system and operation program to run on virtual machine provide the virtualization money of magnanimity Source.
DVMM mainly includes dVB (Distributed Virtual Bus, Distributed sharing bus) and dNUMA (Distributed Non-Uniform Memory Access, distributed Non Uniform Memory Access access) two component parts.
DVB can make the CPU between different machines node communicate with each other, and making can be with carry not on different machines node With I/O equipment, the abstract of virtual cpu and virtual unit is provided.By dVB, the CPU of magnanimity is can be used in client virtual machine The I/O resource of the isomeries such as resource and GPU/FPGA, such as disk, network interface card, GPU and FPGA.
DNUMA can make the memory source on different physical nodes shared, provide the abstract of distributed shared memory. DNUMA synchronizes the data between each physical node using distributed shared memory agreement.Simplest agreement, such as MSI agreement, Realize the memory model with Ordinal Consistency.And this method creatively proposes dNUMA-TSO model, weakens to pass through The degree of consistency improves performance.
The present embodiment proposes the method for three kinds of raising dVMM system performances.By the means of machine learning, dNUMA can be with Learn the feature of a process access remote memory node.A kind of method is that dNUMA to apply by paravirtualized method Reasonable setting compatibility, to reduce the quantity of application access remote memory.Second method is present node vCPU State is sent to the memory node frequently accessed, so that remote memory access be made to become local memory access.The third, passes through The accelerating hardwares such as FPGA are the function that MMU (Memory Management Unit) increases subpage, reduce the granularity of memory sharing.
The core component of dVB is to interrupt forwarding.When an equipment generates interruption, it can be interrupt number together with target CPU number write-in is interrupted in chip.At x86, this interrupts chip and is called IOAPIC.Since in dVMM, target CPU very may be used Can not be in local, at this moment dVB just needs to intercept and capture this and interrupts and be forwarded to target machine.CPU also has similar mechanism.Work as CPU When preparing one internuclear interruption of hair, it can trigger a VMExit, and finally capture for dVB.Similar to I/O interruption, it is final It can be sent on target physical node.Fig. 2 illustrates the architecture diagram of dVB.Solid line in each physical machine indicates can be The interruption of processing locality, including internuclear interruption and I/O are interrupted, and cannot can be sent on virtual bus in the interruption of processing locality Carry out interruption forwarding.
It is to be simulated completely by software (QEMU) that chip is interrupted under virtual environment.Chip itself is interrupted to be not aware that and to receive Or send interruption target may not on the local machine, this process be is completed by dVB, and to interrupt chip sending in Disconnected forwarding instruction.
On each physical node can carry zero to multiple equipment, including disk, network interface card, GPU, FPGA, NVM etc.. DVB provides unified interface to upper layer client operating system these device virtualizations.
For GPU, the virtualization of the heterogeneous devices such as FPGA, dVB safeguards a global page table, to safeguard any two The mapping of physical address on machine.Assuming that GPU carry is on machine A, and machine B needs to access GPU, then dVB can be arrived first Global page table P passes through the physical address p on machine AAFind a global address pG, then pass through pGFinding pair on machine B The physical address p answeredB, then interruption forwarding is carried out, it can thus complete the virtualization of heterogeneous device.Assuming that have N platform machine, that The size of this global page table is O (N).
Fig. 3 illustrates the architecture diagram of dNUMA.On each physical machine, there are the list item and corresponding physical memory of EPT (DRAM).There is corresponding state on EPT list item.It is communicated between different machines node by high speed network.DNUMA uses MSI Agreement is synchronous to carry out memory between node.Mean that dNUMA needs to carry out data simultaneously operating when there is page fault.It sets out existing The node of page fault is px, Owner management node is py, Owner pz.DNUMA handles the step of page fault specifically such as Under:
Step 1:pxTo pySend request;
Step 2:pyIt can read pzNewest page, while if request is to write, Owner will become px, pzSetting is suitable When state;
Step 3:pxIt is filled out in the memory of client operating system after reading page, and sets state appropriate.
Table 1 is that dNUMA operates client operating system to different page status, is illustrated when client operating system is to a page When being operated and (being read or write), dNUMA corresponding operation.
Table 1
Wherein √ indicates to allow such operation under current state.Fig. 4 illustrates the state machine of dNUMA page status conversion Figure.
DNUMA carries out network communication using high speed network, and network here includes but is not limited to RDMA.RDMA is to compare now More popular high speed network solution, it can bypass kernel protocol stack, so that datagram is directly copied to user by network interface card Specified address space, to significantly reduce delay.
DNUMA also uses compression optimization to reduce network flow.Work as pxTo pyWhen transmitting data, if pxIt is protected before upper There is pyThe copy of upper page, then the difference of data on two machines of an efficient algorithm calculating can be used.Due to client The many operations to memory of operating system pertain only to data (such as locking/put lock) seldom on one page of modification, so twice The difference very little of page between operation.The data of network transmission can be greatly reduced using compression optimization.The algorithm of compression optimization is such as Under:
1, work as pxTo pyWhen transmitting a page P, if pyData later no longer change, i.e., state is not Modified, that PxThe value that P will be recorded, is denoted as P0, referred to as twin, and recording this twin value is py's;
2, work as pxTo pyWhen transmitting a page P, if pxOn have twin value, and this twin value is py, then just transmitting The value P of px current page1With twin value P0Difference, be denoted as diff (P0,P1).In general, diff (P0,P1) it is much smaller than P1, so Compression optimization can reduce network transport load.
If using MSI agreement completely, the memory on dNUMA can occur largely to jolt, so as to cause performance decline.This It is because of MSI protocol realization sequential consistency model, and this is excessively stringent for the application of most virtual machines.And DNUMA-TSO reduces consistency intensity to improve performance.By taking x86 framework as an example, x86 framework meets the memory model of x86-TSO, DNUMA-TSO wishes to provide the memory model of x86-TSO to upper layer virtual machine.The core concept of x86-TSO is gentle in CPU There are a write buffer (Store Buffer) between depositing, what the program run on the x 86 architecture was carried out is write, and removes non-designated spy Otherwise different memorymodel can be introduced into write buffer.X86-TSO schematic diagram is as shown in Figure 5.Writing in write buffer is not It can be seen by other CPU, that is to say, that memory is not written for write-in write buffer.Theoretically, it is entered from write buffer Caching (memory system) can pass through the arbitrarily long time, except nonusable routine actively calls memory barrier.Memory barrier can allow All writing in write buffer enter in memory.Under x86 framework, allow the random ordering of write-read, and dNUMA-TSO The write-read that will lead under distributed scene is out-of-order.The native page of machine node is regarded as a write buffer by dNUMA-TSO. Data in write buffer can only be seen by the CPU on other machines node in both cases:
1, a period of time (referred to as time window) has been spent;
2, client machine calls memory barrier.
When the two conditions meet, this page of meeting is broadcast to all machines.Here broadcast refers to as MSI agreement It writes and other pages is set as Invalid state like that.DNUMA determines the suitable of page on every machine using Lamport logical timer Then sequence sequentially merges these pages according to this.In specific implementation, since Intel VT-x technology does not support guest operation system System exits when calling memory barrier, so dNUMA needs the client operating system of (with minor modifications) of a customization, in calling It is explicitly operated using hyper call when depositing barrier.
In NUMA (Non Uniform Memory Access) framework, there is a kind of attribute for being referred to as compatibility, it refers to that a process is more inclined To in operating on which node.By the algorithm of machine learning, such as intensified learning, which process meeting dNUMA is known that The frequently machine node of access distal end.If this phenomenon occurs, dNUMA is informed about the scheduler of client operating system, allows Reasonable compatibility is arranged in it.Frequently access remote memory illustrate local and distal end probably there are two process fight for it is same A resource, such as same lock.Real-time machine study course can capture this feature and predict more reasonable distribution side Formula.DNUMA and client operating system can share one piece of memory.DNUMA inserted in this block shared drive real-time machine learn into The prediction result of journey, and the process of the remote memory frequently accessed can be assigned to distal end according to this result by client operating system On the CPU of machine node.
Another method is migration CPU.Using same machine learning algorithm, dNUMA can interrupt those and be currently running, The frequently CPU of access remote memory, and the state of its CPU is transmitted to distal end.The CPU state of packing includes that can allow The complete context of CPU operation, such as buffer status and interruption chip status (LAPIC).CPU after migration is after resuming operation Remote memory can be accessed and become local memory access.The typical size of the CPU state of packing is about 5KB, only than one Page (4KB) is bigger, so the expense being packaged is negligible.
For typically applying, the page shared library injection of 4KB is bigger, has and increases many pseudo- shared.This meaning Network transmission can all be caused to the modification of 4KB pages of any one upper byte.DNUMA makees accelerator using FPGA, to dVMM's Key operation unloads on (offloading) to specialized hardware (FPGA), the performance to optimization system.Such as increase the function of MMU Can, the subpage logic of 128B may be implemented in this way, i.e., network transmission only can be caused to the modification of this 128B subpage, and The flow of network transmission can also be reduced.
In the present embodiment:
DVMM can be the heterogeneous resource (including CPU/GPU/FPGA, memory and I/O etc.) on multiple physical machine nodes It is shared, it is abstracted into " more void one " virtual machine, client operating system and application software can operate in the void without modification On quasi- machine.To provide the virtualization resource of magnanimity for the client operating system and application program run on virtual machine.DVMM packet Two parts are included, are dVB and dNUMA respectively, the former provides the shared of computing resource CPU/GPU/FPGA and I/O equipment, the latter The shared of distributed memory resource is provided.DVB and dNUMA can be based respectively on but be not limited to QEMU and KVM to realize.
The component part of dVMM is as follows:
A. more physical machines.Every physical machine provides the calculating, memory and I/O resource of a part;
B.dVMM Virtual Machine Manager module.DVB and dNUMA is separately operable on each physical node;
C.dVMM client operating system.DNUMA dependent on client operating system slightly modify its realize (for performance plus Speed) come realize dNUMA-TSO and NUMA compatibility set.
DVB proposes the method that physical computing resources and I/O device etc. are abstracted as a virtual machine by virtual bus, So that upper-layer client operating system obtains consistent virtualization bus view.CPU can send core to other CPU by dVB Between interrupt (Inter-Processor Interrupt), and I/O interruption can also be linked into dVB so that I/O equipment can With the machine being distributed on different physical nodes.It should be noted that I/O equipment here includes but is not limited to the tradition such as disk, network interface card The novel devices such as the isomeries such as equipment, FPGA and GPU acceleration equipment and RDMA, NVM.
DNUMA proposes a kind of distributed NUMA architecture implementation method, is based on distributed shared memory agreement, realization will be more Platform physical machine memory sharing is to provide the memory source of magnanimity to upper layer virtual machine.NUMA friendly application can be transported efficiently Row is on a virtual machine.
The nucleus module dVM and dNUMA of dVMM carries out network communication using high speed network.Here network communication include but It is not limited to RDMA.DNUMA also uses compression optimization to reduce network flow.As physical node pxTo physical node pyTransmit number According to when, if pxP is preserved before upperyThe copy of upper page, then an efficient algorithm can be used to calculate on two machines The difference of data.
DNUMA's proposes a kind of DSM protocol implementing method of optimization, referred to as dNUMA-TSO.This is because traditional MSI protocol realization sequential consistency model, and this constrains too strong for the application of most virtual machines and leads to performance Decline.DNUMA-TSO reduces the intensity of memory consistency.The native page of machine node is regarded as one is write by dNUMA-TSO Caching.Data in write buffer only can just be broadcast to all machines in any case.When dNUMA uses Lamport logic Clock determines the sequence of page on every machine, then sequentially these pages is merged according to this.
DNUMA proposes a kind of implementation method for extending resource compatibility in NUMA architecture across physical node, passes through machine Device learning method, dNUMA are known which process can frequently access the machine node of distal end.If this phenomenon occurs, DNUMA is informed about the scheduler of client operating system, allows it that reasonable compatibility is arranged.
DNUMA proposes the optimization method of CPU thermophoresis.When the process run on a vCPU frequently accesses remote memory When, dNUMA will transmit the state of this vCPU and give REMOTE MACHINE node.The state of packing includes register, is interrupted Chip (LAPIC) etc..
DVMM is proposed based on hardware-accelerated optimization method, and FPGA can be used to accelerate.The function of FPGA enhancing MMU Can, the mode of smaller the being consistent property of shared library injection is provided for the distributed shared memory of dNUMA.
The present embodiment is implemented under the premise of technical solution of the present invention and framework, and provides detailed embodiment and tool Body operating process, but be applicable in platform and be not limited to following examples.
The cluster that specific deployment examples are made of eight common servers, every machine have 64GB memory source.Every Server is equipped with the network interface card for supporting InfiniBand.Server is connected on the InfiniBand interchanger of center by optical fiber.This Invention is not limited by onserver-class sum number purpose, can extend to eight or more server composition clusters.
Every server is equipped with Ubuntu Server 18.04.1LTS 64bit, and is equipped with two-way CPU and amounts to 56 cores With 128GB memory.Six in eight machines are respectively provided with disk, network interface card, GPU, FPGA, RDMA and NVM.In starting dVMM When, successively start the dVMM software on eight machines, entire virtual machine is just opened after the dVMM software on all machines starts Begin to run.
The specific exploitation of the present embodiment is that the source code version conduct based on QEMU 2.8.1.1 and linux kernel 4.8.10 is said Bright, version kernels other for other QEMU and Linux are equally applicable.One can be run on QEMU-KVM with minor modifications Operating system, i.e. Ubuntu Server 18.04.1LTS 64bit.IDE equipment can be run on client operating system, Scsi device, the network equipment, GPU equipment, FPGA device etc..The test program for possessing preferable performance on dVMM should possess NUMA Friendly feature, i.e. programmer should write code according to the characteristics of NUMA architecture, code made to possess preferable locality.Figure meter Calculation, map-reduce etc. are preferable application scenarios.
The preferred embodiment of the present invention has been described in detail above.It should be appreciated that the ordinary skill of this field is without wound The property made labour, which according to the present invention can conceive, makes many modifications and variations.Therefore, all technician in the art Pass through the available technology of logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Scheme, all should be within the scope of protection determined by the claims.

Claims (10)

1. a kind of distributed virtual machine manager characterized by comprising
Virtual Machine Manager module, including the Distributed sharing bus module operated on each physical machine node and distribution Formula Non Uniform Memory Access access modules, by Distributed sharing bus module and distributed Non Uniform Memory Access access modules, The uniform interface of vast resources is abstracted as a virtual machine and is supplied to upper-layer client operating system;Wherein:
The Distributed sharing bus module communicates with each other the CPU between different physical machine nodes, and makes different physics The different I/O equipment of carry on machine node provides the abstract of virtual cpu and virtual i/o equipment;
The memory source that the distribution Non Uniform Memory Access access modules put different physical machines successively is shared, provides Distributed shared memory is abstracted;
Client operating system for distributed Non Uniform Memory Access access modules building dNUMA-TSO model and carries out NUMA parent With property setting.
2. distributed virtual machine manager according to claim 1, which is characterized in that the Distributed sharing bus module Equipped with chip is interrupted, the interruption chip will be unable to the interruption in processing locality according to the instruction of Distributed sharing bus module Carry out interruption forwarding.
3. distributed virtual machine manager according to claim 1, which is characterized in that the Distributed sharing bus module The abstract of virtual heterogeneous device is also provided;Wherein:
Distributed sharing bus module is that heterogeneous device safeguards a global page table, different on any two physical machines to safeguard The mapping of the physical address of structure equipment;Being searched by global page table needs the corresponding physical address of two physical machines accessed to exist Interruption forwarding is carried out, the virtualization of heterogeneous device is completed.
4. distributed virtual machine manager according to claim 1, which is characterized in that the distribution Non Uniform Memory Access Access modules synchronize the data between each physical machine node using distributed shared memory agreement, and pass through dNUMA-TSO Model weakens the degree of consistency;Wherein:
The native page of physical machine node is regarded as a write buffer by the dNUMA-TSO model;Write dNUMA use Lamport logical timer determines the sequence of page on every physical machine, then sequentially these pages is merged according to this.
5. distributed virtual machine manager according to claim 4, which is characterized in that the distribution Non Uniform Memory Access The method that access modules synchronize the data between each physical machine node using distributed shared memory agreement, including walk as follows It is rapid:
Step S1: initialization page table control authority;
Step S2: client operating system brings into operation, and when page fault occurs in client operating system, executes step S3;
Step S3: there is page fault in client operating system, is withdrawn into Virtual Machine Manager module, distributed Non Uniform Memory Access Access modules are arranged suitable permission according to distributed shared memory agreement and read data appropriate, then return step S2。
6. distributed virtual machine manager according to claim 1, which is characterized in that the distribution Non Uniform Memory Access Access modules carry out network communication using high speed network, and reduce network flow using compression optimization.
7. distributed virtual machine manager according to claim 1, which is characterized in that the algorithm of the compression optimization includes Following steps:
As node paTo node pbWhen transmitting a page P, if pbData later no longer change, i.e., state be not it is readable can It writes, then paThe value that P will be recorded, is denoted as P0, referred to as twin, and recording this twin value is pb's;
As node paTo node pbWhen transmitting a page P, if paOn have twin value, and this twin value is pb, then just passing Defeated paThe value P of current page1With twin value P0Difference, be denoted as diff (P0,P1);
Wherein, node paWith node pbFor two physical machine nodes for transmitting page mutually.
8. distributed virtual machine manager according to claim 1, which is characterized in that the NUMA compatibility setting uses Any one following method:
Method one judges the process that can frequently access the physical machine node of distal end, and notify visitor by the algorithm of machine learning Reasonable compatibility is arranged in the scheduler of family operating system;Due to cloth Non Uniform Memory Access access modules and client operating system One piece of memory is shared, cloth Non Uniform Memory Access access modules insert the pre- of real-time machine study course in this block shared drive It surveys as a result, the process of the remote memory frequently accessed is assigned to distal end physical machine section according to this result by client operating system On the CPU of point;
Method two interrupts access remote memory be currently running and frequent using machine learning algorithm identical with method one CPU, and the state of CPU is transmitted and gives distal end physical machine node;CPU after migration is after resuming operation remote memory Access becomes local memory access.
9. distributed virtual machine manager according to claim 1, which is characterized in that the CPU state of the packing includes The complete context that CPU can be allowed to run;
The complete context that CPU can be allowed to run includes: buffer status and/or interruption chip status.
10. distributed virtual machine manager according to claim 1, which is characterized in that in the distribution nonuniformity It deposits access modules and is additionally provided with accelerator;The accelerator uses FPGA, for unloading the key operation of Virtual Machine Manager module Onto FPGA.
CN201810811512.4A 2018-07-23 2018-07-23 Distributed virtual machine manager Active CN108932154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810811512.4A CN108932154B (en) 2018-07-23 2018-07-23 Distributed virtual machine manager

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810811512.4A CN108932154B (en) 2018-07-23 2018-07-23 Distributed virtual machine manager

Publications (2)

Publication Number Publication Date
CN108932154A true CN108932154A (en) 2018-12-04
CN108932154B CN108932154B (en) 2022-05-27

Family

ID=64444142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810811512.4A Active CN108932154B (en) 2018-07-23 2018-07-23 Distributed virtual machine manager

Country Status (1)

Country Link
CN (1) CN108932154B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069440A (en) * 2019-03-11 2019-07-30 胡友彬 Meteorological ocean data Processing Algorithm Hardware system and method based on heterogeneous polynuclear
CN110569105A (en) * 2019-08-14 2019-12-13 上海交通大学 Self-adaptive memory consistency protocol of distributed virtual machine, design method and terminal thereof
CN111090531A (en) * 2019-12-11 2020-05-01 杭州海康威视系统技术有限公司 Method for realizing distributed virtualization of graphics processor and distributed system
CN111273860A (en) * 2020-01-15 2020-06-12 华东师范大学 Distributed memory management method based on network and page granularity management
CN112214302A (en) * 2020-10-30 2021-01-12 中国科学院计算技术研究所 Process scheduling method
CN115150458A (en) * 2022-05-20 2022-10-04 阿里云计算有限公司 Device management system and method
CN117112466A (en) * 2023-10-25 2023-11-24 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680987B1 (en) * 2006-03-29 2010-03-16 Emc Corporation Sub-page-granular cache coherency using shared virtual memory mechanism
CN107491340A (en) * 2017-07-31 2017-12-19 上海交通大学 Across the huge virtual machine realization method of physical machine
CN107967180A (en) * 2017-12-19 2018-04-27 上海交通大学 Based on resource overall situation affinity network optimized approach and system under NUMA virtualized environments
CN108021429A (en) * 2017-12-12 2018-05-11 上海交通大学 A kind of virutal machine memory and network interface card resource affinity computational methods based on NUMA architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680987B1 (en) * 2006-03-29 2010-03-16 Emc Corporation Sub-page-granular cache coherency using shared virtual memory mechanism
CN107491340A (en) * 2017-07-31 2017-12-19 上海交通大学 Across the huge virtual machine realization method of physical machine
CN108021429A (en) * 2017-12-12 2018-05-11 上海交通大学 A kind of virutal machine memory and network interface card resource affinity computational methods based on NUMA architecture
CN107967180A (en) * 2017-12-19 2018-04-27 上海交通大学 Based on resource overall situation affinity network optimized approach and system under NUMA virtualized environments

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PETER J. KELEHER,ALAN L. COX,WILLY ZWAENEPOEL: "TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems", 《USENIX WINTER》 *
PETER SEWELL, SUSMIT SARKAR, SCOTT OWENS,ETC: "x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors"", 《COMMUNICATION OF THE ACM》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069440A (en) * 2019-03-11 2019-07-30 胡友彬 Meteorological ocean data Processing Algorithm Hardware system and method based on heterogeneous polynuclear
CN110569105A (en) * 2019-08-14 2019-12-13 上海交通大学 Self-adaptive memory consistency protocol of distributed virtual machine, design method and terminal thereof
WO2021027069A1 (en) * 2019-08-14 2021-02-18 上海交通大学 Adaptive memory consistency protocol for distributed virtual machines, design method for same, and terminal
CN110569105B (en) * 2019-08-14 2023-05-26 上海交通大学 Self-adaptive memory consistency protocol of distributed virtual machine, design method thereof and terminal
CN111090531A (en) * 2019-12-11 2020-05-01 杭州海康威视系统技术有限公司 Method for realizing distributed virtualization of graphics processor and distributed system
CN111273860A (en) * 2020-01-15 2020-06-12 华东师范大学 Distributed memory management method based on network and page granularity management
CN111273860B (en) * 2020-01-15 2022-07-08 华东师范大学 Distributed memory management method based on network and page granularity management
CN112214302A (en) * 2020-10-30 2021-01-12 中国科学院计算技术研究所 Process scheduling method
CN112214302B (en) * 2020-10-30 2023-07-21 中国科学院计算技术研究所 Process scheduling method
CN115150458A (en) * 2022-05-20 2022-10-04 阿里云计算有限公司 Device management system and method
CN117112466A (en) * 2023-10-25 2023-11-24 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster
CN117112466B (en) * 2023-10-25 2024-02-09 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster

Also Published As

Publication number Publication date
CN108932154B (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN108932154A (en) A kind of distributed virtual machine manager
US11645099B2 (en) Parallel hardware hypervisor for virtualizing application-specific supercomputers
Aguilera et al. Remote memory in the age of fast networks
Aguilera et al. Remote regions: a simple abstraction for remote memory
US9619270B2 (en) Remote-direct-memory-access-based virtual machine live migration
Young et al. The duality of memory and communication in the implementation of a multiprocessor operating system
CN100504789C (en) Method for controlling virtual machines
Iftode et al. Scope consistency: A bridge between release consistency and entry consistency
US7380039B2 (en) Apparatus, method and system for aggregrating computing resources
US9851918B2 (en) Copy-on-write by origin host in virtual machine live migration
US9772971B2 (en) Dynamically erectable computer system
TW201227301A (en) Real address accessing in a coprocessor executing on behalf of an unprivileged process
CN113835685A (en) Network operating system design method based on mimicry database
WO2022139920A1 (en) Resource manager access control
Sun et al. Baoverlay: A block-accessible overlay file system for fast and efficient container storage
US20080140762A1 (en) Job scheduling amongst multiple computers
US10108349B2 (en) Method and system that increase storage-stack throughput
US20230112225A1 (en) Virtual machine remote host memory accesses
Principe et al. A distributed shared memory middleware for speculative parallel discrete event simulation
Ramesh et al. Is it time to rethink distributed shared memory systems?
Tanenbaum A comparison of three microkernels
Shan Distributing and Disaggregating Hardware Resources in Data Centers
Butelle et al. A model for coherent distributed memory for race condition detection
Polze et al. Parallel Computing in a World of Workstations.
Denton et al. Distributed shared memory using reflective memory: The LAM system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant