CN103140830B - The system and method that the input/output of virtual network is controlled - Google Patents

The system and method that the input/output of virtual network is controlled Download PDF

Info

Publication number
CN103140830B
CN103140830B CN201180045902.1A CN201180045902A CN103140830B CN 103140830 B CN103140830 B CN 103140830B CN 201180045902 A CN201180045902 A CN 201180045902A CN 103140830 B CN103140830 B CN 103140830B
Authority
CN
China
Prior art keywords
virtual machine
packet
supervisory process
passenger plane
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201180045902.1A
Other languages
Chinese (zh)
Other versions
CN103140830A (en
Inventor
考什克·C·巴德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority claimed from PCT/US2011/053731 external-priority patent/WO2012044700A1/en
Publication of CN103140830A publication Critical patent/CN103140830A/en
Application granted granted Critical
Publication of CN103140830B publication Critical patent/CN103140830B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

According to an embodiment, a kind of method running virtual machine on the server, described method includes: uses the first supervisory process (310) run on described server to be controlled the data path resource distributing to described virtual machine, is controlled data path resources including being controlled the data path of the hardware interface device (316) being coupled to described server;And using the second program (304) run on described server that the control path of described hardware interface device and initialization resource are controlled, wherein said second program (304) is separated with described first supervisory process (310).

Description

The system and method that the input/output of virtual network is controlled
What application claims was submitted on October 1st, 2010 invention entitled " accelerates network path: use Many queues of NIC auxiliary improve the new method of virtual network input/output performance " the It is invention entitled " to void that No. 61/389,071 U.S. Provisional Application case and JIUYUE in 2011 are submitted on the 28th Intend the system and method that is controlled of input/output of network " No. 13/247,578 U. S. application case Earlier application priority, the content of the two earlier application is expressly incorporated herein in this in the way of introducing.
Technical field
The present invention relates to computer server, and in a particular embodiment, relate to a kind of for virtual net The system and method that the input/output (IO) of network is controlled.
Background of invention
It is known that Intel Virtualization Technology can i.e. service (IaaS) as the core infrastructure that various clouds are disposed Key.In past about 10 years, the progress of x86 hardware auxiliary is to focus on performance and the virtualization of scalability Solution has paved road.Management program, also referred to as virtual machine monitor (VMM), for conduct The operating system (OS) that passenger plane (guest) runs uses software instruction interception mechanism to imitate centre Reason device (CPU), memorizer and I/O resource.The VMM correctly write can provide to passenger plane OS can Lean on, safe and exact system view.
VMM can use CPU, memorizer and the IO response that half virtualization guarantees to be concerned with.For CPU For virtualization, before Intel Virtualization Technology (VT) extension is implemented, the VMM such as such as VMWare/Xen Use Math input/output (BIOS), CPU scheduler and the ring of complexity and command simulation, With the fictitious order on " capture and perform " actual core cpu.But, modern CPU can process hospitable Machine is to root partition, and unloads the virtualized heavy burden of CPU for VMM.This is generally by from physics " a collection of " executed instruction set of passenger plane on CPU provides cpu instruction to realize.Such as by English Te ErAnd Advanced Micro Devices Inc.Main X86-based offer safety Deng production Virtual machine (SVM)/virtual machine extension (VMX) instruction, to carry out VMENTER/VMEXIT behaviour Make.Additionally, CPU is improved, such as, the bypass conversion buffered district of labelling (TLB), suspend-follow Ring exits (PLE), and virtual processor identifier (VPID), thus reduces passenger plane CPU and share Required cpu cycle number.Memory optimization is also focused on by making page table have shade, optimize by VMEXIT The guest address conversion caused and expense are to reduce passenger plane page iteration.
I/O virtualization is still raising virtualization IO performance and the importance of scalability, and network I/O is One of core aspect of the most concerned network virtualization.The most queues are applied, for direct equipment The bases such as the Intel Virtualization Technology (VT-d) based on Intel of connection, single I/O virtualization (SR-IOV) In Intel Virtualization Technology and the virtualization terminal based on software thereof of hardware, such as, Xen xegregating unit drives Territory (IDD) and network channel (Netchannel) 2, to improve the overall network interface card of each passenger plane (NIC) performance, but at aspects such as such as CPU loading, architecture cost and Information Securities Still have much room for improvement.
Summary of the invention
According to an embodiment, a kind of method running virtual machine on the server, comprising: use The data path resource distributing to described virtual machine is entered by the first supervisory process run on described server Row controls, and is controlled including the hardware interface device to being couple to described server to data path resources Data path be controlled;And use the second program run on described server to described hardware The control path of interface equipment and initialization resource are controlled, wherein said second program and described first Supervisory process is separately.
It is according to another embodiment, a kind of for the method running multiple virtual machine operations server systems, Comprising: Loading Control planar process;Load supervisory process, by virtual machine instantiation, via described prison Tube side sequence controls data path resource between described virtual machine and hardware interface device, and via described Control planar process the control path of described hardware interface device and initialization resource are controlled.
According to another embodiment, a kind of data handling system for virtual machine, comprising: processor; It is couple to the memorizer of described processor;And interface port, it is used for being couple to divide with described processor The hardware network interface equipment opened.Described processor for run the first program, described first program via Described interface port bag team in the bag queue relevant to described virtual machine and described hardware network interface Data are transmitted between row.This program is further used for running the second program, described in described second programme-control The configuration of hardware network interface equipment, described second program is separated with described first program.
According to another embodiment, a kind of non-momentary computer-readable media stores executable program.Institute State program instruction processor perform following steps: Loading Control planar process;Load supervisory process;By void Plan machine instantiation;Between described virtual machine and hardware interface device, data are controlled via described supervisory process Path resources;And via described control planar process to the control path of described hardware interface device and just Beginningization resource is controlled.
Outline the feature of the embodiment of the present invention the most widely, thus contribute to being more fully understood that Hereafter detailed description of the invention.Will be discussed below other features of the every embodiment of the present invention and excellent Point, these feature and advantage constitute the subject matter of claims of the present invention.The technology people of art Member is it will be appreciated that can revise easily based on disclosed concept and specific embodiment or be designed for realizing Other structures of the identical purpose of the present invention or process.Those skilled in the art should be further appreciated that this Class equivalent structure is without departing from the spirit and scope of the present invention proposed in appended claims.
Accompanying drawing is sketched
In order to be more fully understood from the present invention and advantage thereof, in conjunction with accompanying drawing with reference to following description, wherein:
Fig. 1 show the framework of prior art Xen network channel (Netchannel)-2;
Fig. 2 show the embodiment of multiqueuing system framework;
Fig. 3 a to Fig. 3 c show transmitting, receive and control the block diagram of embodiment in path;
Fig. 4 show the embodiment of the network equipment;And
Fig. 5 show the embodiment of processing system.
Detailed description of the invention
Enforcement to current most preferred embodiment and use are discussed in detail below.It is to be understood that the present invention carries Can application invention concept for can be used for various many most hereinafter.The specific embodiment discussed is only Only illustrate to make and use the concrete mode of the present invention, and do not limit the scope of the invention.
Some embodiment method is the alternative method of existing half virtualization scheme, and includes the component of uniqueness (building block), described component utilizes many queues performance based on NIC hardware, reduces CPU simultaneously Context switches.The system and method for embodiment provides I/O virtualization replacement scheme, and the program uses each Plant management program environment, include, but not limited to Xen and manage program environment.
Some embodiment system and method (1) is launched (TX)/reception (RX) operation for bag and is removed many Remaining context switching;(2) many queues NIC is utilized;(3) PV or hardware auxiliary virtualization it are applicable to (HVM) territory is arranged;(4) and utilize similar VT-d to be directly connected to the performance of equipment, thus formed can Scaling, low latency and power managed more preferable network I/O channel.In certain embodiments, such as, Intel can be usedImplement many queues NIC.Or, other NIC device can be used.
Some embodiment system includes for the control path at a slow speed of x86 system and dividing of rapid data path Cut the network equipment and IO(Split Network Device&IO) view.This is applicable to such as store Other equipment such as equipment.Embodiment system and method also includes exposing the I/O device in management program environment The method that depositor processes and NIC queue maps, and/or compared with existing half virtual method, reduce CPU context switches.Other system also can increase virtual machine (VM) for each core cpu/thread I/O latency and VM page cache locality.
Some embodiment system can be implemented by changing and extending Xen framework.This type of solution also accords with Close CPU VT extension.In an examples Example, Xen4.0,2.6.32 kernel be used as at 64 and The territory 0(Dom0 of operation in the Linux passenger plane environment of 32/64).Or, other environment can be used.
The machine environment such as similar Xen use emulator model (QEMU) or half virtualization to imitate net Network IO.One reason of performance degradation is the cost of the virtualization I/O device of each VM.In recent years, Overall CPU and the memorizer of virtualization passenger plane has been improve for the auxiliary development of virtualized hardware Can, so that virtualization passenger plane almost can realize the performance of the machine.The network equipment and operation thereof are a pair core Disposition energy sensitive indicator.When relating to TX/RX data, from server to desktop computer and notebook computer Any operation system and the application program that wherein runs all rely on almost the performance of " lossless ".Virtual Under change, network I/O performance becomes more crucial.Therefore, embodiment method is focused on improving network I/O.Though Some embodiment the most described herein uses Xen, as sample management programming system, embodiment is described System, but the management program of other embodiment system and method is unknown, and be typically enough to use other X86 hypervisor work, includes, but are not limited to kernel virtual machine (KVM).
In one embodiment, QEMU device model and half virtualization all use software I O virtualization skill Art.QEMU is to have in the original open source software I O emulation device of PIIX3/PIIX4, described PIIX3/PIIX4 is pre-ICH South Bridge chip, and it is (soft that it can imitate BIOS, ACPI, IDE controller Dish, CD-ROM, hard disk), framebuffer device, and the network controller as master controller (RealTek, e1000).Additionally, the most added more self-defined extension.This type of embodiment arranges and allows I/O channel from passenger plane operates TX/RX(and by memory-mapped io (MMIO) and directly stores Device access (DMA)) special permission passenger plane in each physical device on by gpfn to mfn change, in The page table of disconnected process and complexity moves thus operates with " capture and forward " pattern.These situations can make Emulator has " actual " device map.The subject matter bag of emulator model based on QEMU Include performance, extensibility and power managed.These measurement results are soft with each system based on management program " context switching " cost between the kernel/user model of part layer is directly related.
The most hereafter switching between passenger plane and host subscriber's pattern, kernel mode comes with management program Switchback is changed and exponential instructions can be made to be loaded on CPU, consequently, it is possible to cause performance delays, scalability to drop Low, instruction execution delay reduces and power consumption load increases.A kind of original PV method be " shearing " based on The emulation of QEMU, is used for processing I/O channel operation (DMA/MMIO), and is driven by physical device Dynamic program is direct in TX/RX relief area is affixed directly to physical equipment.Including self-defined rear/front wheel driving Modified passenger plane/the dom0/Xen of program is set to memorizer change page (authorization list), and makes passenger plane TX/RX Buffer mapping is to physical device TX/RX.Context switching surfaces in original PV method is very Height, this is owing to TX/RX activity can switch through at least 4 contexts, such as, from passenger plane to Family/kernel, to Dom0 to management program, thus results in that cycle of each bag is longer and TLB expense is higher. He Saileinatuosangtuosi (Jose Renato Santos), farmland on a plateau Tener (Yoshio Turner), G are (about Writing brush) Jia Naji Raman (G. (John) Janakiraman), and her grace Alexandre Desplat (Ian Pratt) exists The technical report HPL-2008-39 of HP Lab " shortens I/O virtualized software and hardware technology Between gap (Bridging Gap between Software and Hardware Techniques for I/O Virtualization) proposing a kind of alternative method in ", described document is hereby incorporated herein by this. IDD rather than Dom0 is used as equipment I O Channel Host by the method, to improve passenger plane I/O channel performance. One problem of this solution is, the quantity of " intercept " element relevant with context switching is the most unchanged Change.Additionally, due to more VM can cause delay, therefore, handling capacity significantly reduces.The method is also Have that VM TX/RX queue is too much and (subscription) problem is subscribed in inequitable queue.Most NIC has single TX and the RX queue from VM, and this queue exists queue and consumes problem excessively. Such as, the VM of the application program with height " numerous and diverse " can allow this VM use TX/RX relief area to account for There is physical device queue, and make other VM " lack " I/O channel operation.
Hardware is the most fast-developing, to support the hardware based virtualization auxiliary for managing program.NIC Hardware business has produced based on SR-IOV and card based on many queues performance, thus " unloading " is based on software TX/RX buffer circle.Card based on SR-IOV uses the PCI configuration of change and based on PF/VF Master-slave equipment driving model (master and slave device driver model), thus by NIC IO Channel maps directly to eliminate the VM of master monitor dependency as individually " physical equipment ".
Fig. 1 show Xen network channel-2100 of the many queue devices of support with IDD, and it is existing There are I/O channel framework and stream.Xen PV architecture includes: rear end/front-end driven program, memory page Share, and the data structure that management program is supported, such as, the event for interrupt notification and readjustment is believed Road, for the hypercalls of inter-domain communication and authorization list.The network I/O Channel view of passenger plane depends on visitor Machine type, i.e. whether passenger plane operates under PV or HVM pattern.Tradition HVM pattern refers to, Passenger plane OS is not carrying out the situation of any amendment to its kernel or user model parts or its life cycle Lower operation.
Xen management program 110 is to process memorizer management, VCPU scheduling and interrupt schedule Little management program.Dom0 to PV is arranged, including authorization list, event channel and shared memorizer, Support all IO, to control and to represent passenger plane.Dom0 to QEMU device model via Netfront/Netback driving mechanism obtains virtualization IO and arranges or partly virtualize setting.Additionally, completely Virtualization or paravirtualized DomU passenger plane receive all of I/O service from Dom0.Due to regard to CPU For cycle, Full-virtualization IO is much more expensive, and therefore the where the shoe pinches of the method is that system operates Slower.Although half virtualization IO handling capacity when passenger plane quantity is few is good, but this handling capacity can be along with visitor The increase of machine quantity and reduce rapidly.In both cases, I/O latency does not becomes with the passenger plane number of increase Ratio.This represents that the scalability of IaaSV cloud VM IO has difficulties.
About current I/O latency, PV uses and shares memorizer, relates to travelling to and fro between the bright of operation to use The memorizer really replicated " authorizes " mechanism mobile TX/RX data between passenger plane and Dom0.Passenger plane and Event channel between Dom0 is used for activating TX/RX activity-triggered device.Described event is actually interrupted Dom0 or passenger plane, be specifically dependent upon the direction of data.Additionally, carry out interrupting transmission to need context Being switched to management program, context is switched to predetermined Dom the most again.Therefore, PV method is added up and down Literary composition switching, thus add delay, and then damage the scalability of VM.
For PV passenger plane network I/O stream, in PV situation, guest virtual physical address (VPA) Space is varied to map Xen memorizer (Xen-u).Generally, Xen memorizer includes about 12M's Map memorizer, and direct and the front-end driven program of Xen-VPA interaction, this front-end driven Program includes I/O channel processing routine.Dom0 preserves backend driver, the machine NIC driver And I/O event channel, it is for driving the Dom0 of passenger plane TX/RX relief area/packet stream to share memorizer Data structure.Equipment arranges and generally completes under OS bootstrap.OS is for the base including NIC card Plinth hardware is detected, and arranges virtual pci configuration space parameter and break path (interrupt Routing).General networks IO includes meshwork buffering district, is expressed as TX/RX relief area.Virtual network The middle use identical concept, but, owing to virtual network IO is more focused on passenger plane network I/O performance, therefore The CPU of virtual network IO on virtual machine nodes is relatively costly.There is such as network in old PV method Bridge implements suboptimum and the saturated shortcoming such as higher of CPU.CPU is saturated also implies that VM application program The available cycle is less, thus for cloud computer IaaS, the method is less feasible.Network channel 2 It also is adapted for utilizing many queues NIC and/or the NIC of SR-IOV being carried out.
In one embodiment of the present invention, switch by reducing context, the PV framework of change props up Holding multiple HW queue, the PV framework of described change improves IO scalability and allows VMDQ to postpone. This is realized by the dual-view (dual view) using NIC device.In one embodiment, Dom0 In NIC driver for executive control operation, carry out including to pci configuration space and break path Controlling, mainly MMIO is movable in this control operation.Fast-path queue memorizer is mapped in management journey Sequence driver processes.The two driver can check (see) device register group, control knot Structure etc..In one embodiment, RX/TX DMA transfer has up to three context switchings, such as Xen, kernel 0 and user replicate.
Fig. 2 show many queues (VMQ) system 200 according to one embodiment of the present invention.Ying Liao Solve, although describe VMQ system for purpose of explanation, but embodiment concept also apply be applicable to alternative system. VMQ system 200 uses network channel-2 structure of improvement, and described structure has the parts that band works Many queues TX/RX data stream.In one embodiment, relative to existing network channel-2 structure Add and/or change IGB netU driver 210, IGBX emulator 222, IGBX emulation PC I C configuration 230, IGBU-1 driver, IGBU-n driver 232, IGB0NIC driver 226, netUFP block 218 and fast path data processor 220.
In one embodiment, MMIO controls stream IGB from the passenger plane kernel spacing of passenger plane DomU The NetU driver 210 IGBX emulator 222 in special permission passenger plane Dom0.Then, control Stream processed continues through the IGB0NIC in IGBU-1224 driver and Dom0 kernel spacing and drives journey Sequence, and eventually arrive at the NIC card 212 in hardware.On the other hand, data stream is at passenger plane Dom0 kernel Between IGB NetU driver 210 in space, through RX relief area 216 and tx buffering district 214 and arrive the NetUFP block 218 in management program 208.Data stream continues from NetUFP block 218 Flow to the fast path data processor 220 also in management program 208, and flow on NIC212.
By being removed as replicating bottleneck by Dom0, this embodiment method can be offset and be driven journey by netback The passenger plane mandate completed in sequence replicates the higher CPU cost caused such as, for the typical net of VM Typical case's bag RX path of network channel 2 has steps of: user replicates;Data are multiple from passenger plane kernel System is to user buffering district;Carry out passenger plane kernel coding;Xen code is performed in passenger plane context;Authorize multiple System;Kernel coding is carried out in Dom0;And in Dom0 kernel context, perform Xen code.Institute These steps are had to include seven context switchings in PV code.Network channel 2 optimizes mandate duplicate stage, Method is to move this function to passenger plane, thus improves by bag is placed in passenger plane cpu cache context RX performance, and then unloading Dom0CPU caching.This can eliminate Dom0 duplicate stage.Network channel 2 are also used for making the passenger plane caching that is mapping through of RX descriptor to map directly to Guest memory.Return to figure 1, Netfront driver in passenger plane territory 104 102 by I/O channel to I/O buffer mandate, To allow many queue devices driver use described Netfront driver.Territory 108 is driven to preserve Netback106 and licensing scheme.The page that described driving territory is responsible in TX/RX activity relating to is retouched State directly locking and the safety of symbol.
It is multiple that the network channel 2 of the direct queue mapping with each passenger plane RX path has user model passenger plane System, wherein packet replication replicates be mapped to RX queue to local Guest memory, described user model passenger plane Memory descriptor, thus reduce the mandate duplication in relief area doubling time, passenger plane kernel mode, Dom0 In kernel mode, and Xen code.Authorizing after replicating raising, RX performance will necessarily improve.Example As, the consumption of physical cpu is down to 21% from 50%.But, the quantity of context switching only reduces one, That is, the Dom0 user model mandate copy step in original PV method.TX path has similar upper Hereafter number of handovers.The entire throughput of the cost impact system of context switching, and finally affect system Scalability.Additionally, the context number of handovers between passenger plane/Dom0/Xen also can produce thousands of CPU Cycle, thus affect overall power.Therefore, embodiment method utilizes the many teams being similar to network channel 2 Row, but the quantity switched by context is reduced to three from six, thus eliminate mandate duplication, Dom0 user And the switching of kernel context.A kind of embodiment method depends on MSI, and in the availability of relief area/ MSI is distributed to each passenger plane queue IRQ by outer skb.
In one embodiment, above-mentioned framework is divided into two parts: TX/RX data path, referred to as through street Footpath;And equipment initializes/controls path, referred to as slow-path.The two path all sets as emulation For presenting to passenger plane, as two kinds of standalone features of same device PCI.Slow-path equipment is common Emulator, its all of function is by QEMU process.Fast path equipment is PCI function, its Pci configuration space is still imitated by QEMU, but Parasites Fauna emulation is by management routine processes.Fast path Also use queue that dma operation is set.These queues map directly in guest address space, but interior Hold and quickly changed by management program, imitate fast path Parasites Fauna simultaneously.In one embodiment, when When each MSI-X vector is further exposed to passenger plane as RX relief area availability, there is MSI-X Support.Flow interruption is forwarded directly to passenger plane by management program.
In one embodiment, queue relevant rudimentary framework is implemented by such as the hardware of passenger plane architecture. Interrupt being alternatively each passenger plane resource, therefore can use message signal interrupt (MSI-X).An enforcement In example, each queue uses an interrupt vector, and another interrupt vector is used for controlling.Or, often Individual passenger plane can use an interrupt vector.In one embodiment, MSI-X vector be sent to passenger plane it Before, first become INT-X to interrupt by management Program transformation.In one embodiment, Dom0 can check completely NIC, pci configuration space, Parasites Fauna and MSI-X interrupt capabilities.NIC queue can set at equipment Put period directly mapping.
In one embodiment, IGB0NIC driver is used as secondary driver (side driver), with Process all of initialization, control and faulty operation.If Dom0 also uses NIC to send and connect Receive network data, then queue is controlled by Dom0 driver the most completely.Driver is only limited to it Each territory resource (queuing register, interrupt bit etc.) of body and do not contact any passenger plane resource.
In one embodiment, IGBU is used as DomU pair driver, to initialize the control of their own Device part processed.This IGBU processes the TX/RX queue of oneself and relevant depositor, and oneself's limit It is made as being only limited to each territory resource of oneself, without contacting any other passenger plane resource.A reality Execute in example, if increased in terms of safety under performance impact, then can force in management program Perform this self limit of passenger plane.
In one embodiment, QEMU presents new emulator to passenger plane, and equipment configuration is led in process All entrances in space.Passenger plane can be appreciated that in order to MSI-X interrupt vector is connected to correct INT-X QEMU hypercalls.Perform another hypercalls, with allow management program understand its have to for visitor The scope of the MMIO that the machine network equipment imitates.One important parameter is passenger plane pond number, and it is passenger plane The index (index) of any hardware resource.In one embodiment, QEMU is deposited by PCI configuration Device can allow passenger plane see this number.
As for management program, at the initialization time of territory, create the example that fast path equipment controls, its The NIC hardware resource belonging to given passenger plane is controlled.QEMU calls applicable hypercalls After arranging MMIO simulation scale, management program can intercept and imitate by passenger plane by described passenger plane All MMIO that network driver completes.One embodiment aspect of emulation is conversion passenger plane buffering ground Location.It is to preserve queue address write depositor (register that management program completes the mode of this situation Write) the passenger plane queue address on, all addresses that the most described management program checkout is write by it It is converted into the queue on rear of queue write depositor.
Fig. 3 a show the high level block diagram of embodiment DMA transmitted data path 300.An embodiment In, launch stream and start from passenger plane application-level.Such as, application call DomU304 is Tracking sends packet with mailing to (...) 306.System is called and is mail to the switching of (...) 306 context To the passenger plane network driver NetUDrv308 of driving simulation equipment, therefore, in device register group The each MMIO completed will be captured in management program 310.Management program 310 uses and is subordinated to mould The equipment of block NetUFP312 imitates MMIO and drives hardware, thus uses accordingly at fast path Reason program module 314.
Owing to passenger plane driver cannot know machine frame number, and guest physical frame number can only be accessed, therefore, Conversion of page is rapidly completed while imitating MMIO.Then, by the DMA electromotor of NIC316 It is set to start to transmit data on circuit.After packet 318 sends, during NIC316 beams back Disconnected.Described interruption is forwarded to passenger plane kernel by management program 310.The process journey that passenger plane kernel calls is suitable for Sequence is purged.
Fig. 3 b show embodiment DMA and receives the high level block diagram of data path 320.An embodiment In, receive stream and start from coming the packet 324 of automatic network.Packet passes through DMA completely into system After memorizer, NIC316 sends to management program 310 and interrupts.Then, management program 310 is by described Interrupt being forwarded to passenger plane kernel.Interrupt handling routine in passenger plane NetUDrv308 uses MMIO to examine Look into the number of buffer of filling, then those relief areas are forwarded to top network stack.
Then, by new Buffer allocation and arrange in RX queue.Setting up procedure with MMIO to team Row tail depositor terminates.This MMIO is captured by managing program 310.This spy in NetUFP312 The supervisory program simulation of different depositor also processes the conversion of all buffer addresses of write RX queue.
Fig. 3 c show embodiment MMIO and controls the high level block diagram in path 330, and it is imitated as complete True equipment is implemented.In one embodiment, this emulation is performed by the QEMU332 in Dom0302. QEMU332 is through Dom0NIC driver Net0Drv334, to control and to receive to travel to and fro between NIC The control event of 316.Emulator is sent to passenger plane, as the same emulation net for data path 2nd PCI function of network equipment.In one embodiment, single interrupt vector is distributed to this PCI Function, and QEMU332 uses this interrupt vector that control event is sent to passenger plane driver, described QEMU processes these operations in NetUDrv308.Come from what the NetUDrv308 of passenger plane completed MMIO is forwarded back to the QEMU332 in Dom0302.Due to compared with data path, and infrequently Executive control operation, therefore, data throughout or data delay are seldom had or do not have by the delay of these operations Have an impact.
In one embodiment, service implementation device can use the memorizer of 25GB, based on Intel There is on the four core CPU of 5500 Intel of many queues (VMD-q) ability1GBE controls Device, described four core CPU use 2.6.32 kernel as Dom0 and at 64 and 32/64 Linux Passenger plane environment operates.The TCP bag using Ethernet bag size to be 52 bytes and 1500 bytes to the maximum enters Row is measured.It will be appreciated that this server configures is only intended to implement many services of embodiments of the invention One example of device configuration.
Embodiment is accelerated network path embodiment and is removed the relief area replication overhead occurred in licensing scheme. Such as, the Dom0 expense during I/O data transmission is 0, and Dom0CPU caching is not made Passenger plane numeric field data pollute.Owing to NIC queue memory maps directly to passenger plane, therefore, decrease The computation burden of VCPU scheduler program, thus improve the fairness of credit scheduler program.Finally, move The dynamic TX/RX data being transferred to passenger plane OS make the distribution after RX operates of passenger plane OS driver slow Rush district, thus in the reproduction process of relief area, better control over data buffer storage alignment.It is right that this also reduces The pollution of DomU cpu cache.It is real that table 1 shows that existing network channel 2 and embodiment accelerate network path Execute the performance comparison of scheme.Here, it can be seen that after there is Dom0CPU load, with network channel 2 channelling modes are compared, and along with passenger plane number increases, embodiment is accelerated network path embodiment and had more Linear I/O latency response.
Table 1:TX/RX performance comparison
At least some of described embodiment feature and method may be implemented in the network equipment or parts, such as, Ethernet or Internet protocol (IP) node.Such as, the feature/method in the present invention can use hardware, Firmware and/or through install with on hardware run software implement.The network equipment/parts or node can be Any equipment of transmission frame in the such as network such as Ethernet or IP network.Such as, the network equipment/parts Bridge, transducer, router, or the various combinations of this kind equipment can be included.As shown in Figure 4, net Network equipment/parts 400 comprise the steps that multiple inbound port 402 or unit, in order to receive frame from other nodes; Logic circuit 406, in order to determine the destination node sending frame;And multiple outbound port 404 or unit, In order to send frames to other nodes.
Fig. 5 show the processing system 600 that can be used for implementing the method for the present invention.In the case, main The place's reason processor 602 wanted performs, and described processor can be microprocessor, Digital Signal Processing Device or any other processing equipment being suitable for.In certain embodiments, processor 602 can be by multiple process Device is implemented.Program code (such as, implementing the code of above-mentioned algorithm) and data can be stored in memorizer In 604.Described memorizer can be local storage or the mass storages such as such as DRAM, such as, Hard disk drive, CD drive or other memorizeies (it can be Local or Remote).Although using single Individual block functionally illustrates memorizer, it is to be understood that one or more hardware block can be used to implement this merit Energy.
In one embodiment, processor can be used for implementing some in each unit shown in Fig. 5 or all lists Unit.Such as, described processor can be used as specific functional unit at different time, to implement to perform this Subtask involved during bright technology.Or, different hardware blocks can be used (such as, with processor phase Same or different) perform difference in functionality.In other embodiments, some subtask is performed by processor, Other subtasks then use single circuit to perform.
Fig. 5 also illustrates I/O port 606, and it is used as the interface of Network Interface Unit.Network Interface Unit 608 can implement as NIC (NIC), described NIC as described above and according to Embodiments thereof described above configures, and described Network Interface Unit provides interface for network.
According to an embodiment, a kind of method running virtual machine on the server, comprising: use The data path resource distributing to described virtual machine is entered by the first supervisory process run on described server Row controls, and is controlled including the hardware interface device to being couple to described server to data path resources Data path be controlled;And use the second program run on described server to described hardware The control path of interface equipment and initialization resource are controlled;Wherein said second program and the first program Separately.In one embodiment, described first supervisory process can be management program, and described second program Can be to control plane.In certain embodiments, the data path to hardware interface device is controlled including The data path of NIC (NIC) is controlled.
In certain embodiments, control described data path resource and farther include the first supervisory process interception And imitate all memorizeies mapping input/output (MMIO) performed by described virtual machine.Imitation can be wrapped Include the buffer address changing described virtual machine.In certain embodiments, the first supervisory process monitors described number According to path resources, to prevent from violating safety.
In one embodiment, described method also includes via described first supervisory process at described virtual machine And launch between hardware interface and receive bag.Launch packet and comprise the steps that startup system in guest user territory Tracking use, to send packet;Described packet is switched to passenger plane from described guest user territory context Kernel;Described packet is switched to described first supervisory process from described passenger plane kernel context;And Described packet is transmitted into described hardware interface device from described first supervisory process.Additionally, reception number Comprise the steps that according to bag and described bag is transferred to the first supervisory process from hardware interface device;By described packet It is switched to passenger plane kernel from the first supervisory process context;And by described packet from described passenger plane kernel The system that context is switched in user domain is called.
It is according to another embodiment, a kind of for the method running multiple virtual machine operations server systems, Comprising: Loading Control planar process;Load supervisory process, by virtual machine instantiation;Via described prison Tube side sequence controls data path resource between described virtual machine and hardware interface device;And via described Control planar process the control path of described hardware interface device and initialization resource are controlled.At certain In a little embodiments, described supervisory process includes management program.Control data path resource and can include described prison Pipe program intercepts also imitates all memorizeies mapping input/output (MMIO) performed by described virtual machine. In one embodiment, simulation includes the buffer address changing described virtual machine.In one embodiment, Control data path resource can include from the queue of described virtual machine, data are mapped to described hardware interface The queue of equipment.
In one embodiment, described method farther includes via described supervisory process at described virtual machine And launch between described hardware interface and receive bag.Additionally, in certain embodiments, by described virtual machine Instantiation includes multiple virtual machine instantiation.
According to another embodiment, a kind of data handling system for virtual machine, comprising: processor; It is couple to the memorizer of described processor;And interface port, it is used for being couple to divide with described processor The hardware network interface equipment opened.Described processor for run the first program, described first program via Described interface port bag team in the bag queue relevant to described virtual machine and described hardware network interface Data are transmitted between row.This program is further used for running the second program, described in described second programme-control The configuration of hardware network interface equipment, described second program is separated with described first program.An enforcement In example, described processor can be used for running described virtual machine.
In one embodiment, the first program is management program, and it can include fast path data-driven journey Sequence, described driver is couple to that memorizer maps input/output (MMIO) and imitates described virtual The equipment of machine.In certain embodiments, described hardware network interface equipment is NIC (NIC). Some system may also include hardware network interface equipment itself.
According to another embodiment, a kind of non-momentary computer-readable media stores executable program. Described program instruction processor execution following steps: Loading Control planar process;Load supervisory process;Will Virtual machine instantiation;Between described virtual machine and hardware interface device, number is controlled via described supervisory process According to path resources;And via described control planar process to the control path of described hardware interface device and Initialization resource is controlled.The step controlling data path resource can farther include described supervisory process Perform following steps: intercept and imitate all memorizeies performed by described virtual machine and map input/output (MMIO).Additionally, imitate the buffer address that step can include changing described virtual machine.An enforcement In example, described supervisory process can be management program.
In certain embodiments, the described step controlling data path resource comprises the following steps: by data The queue of described hardware interface device it is mapped to from the queue of described virtual machine.Described program also can be further Described processor is indicated to further perform the step of: via described supervisory process at described virtual machine with hard Launch between part interface and receive bag.
The advantage of embodiment includes: context number of handovers reduces;By using via utilizing many queues NIC Realize direct IO page-map and reduce CPU cost;Passenger plane VMIO postpones to increase, with existing PV implements to compare, and this directly affects scalability;And enter the ability of x86-64 and x86-32 bit pattern. The bonus effect that cpu load reduces also improves the scalability of VM, and makes effect management more preferably.
Other advantages include improving I/O latency.Such as, with such as other nearest PV such as network channel 2 grade Method is compared, and embodiment I/O mechanism can improve, such as, the I/O latency of about 68%, and with tradition PV Method is compared, and improves about 100%.
Additionally, compared with such as current PV embodiment such as network channel 2 grade, exemplary embodiment shows Write and benefit is provided, the advantage simultaneously remaining current PV, such as, guest address spatial separation;For To the distinguished IO capture accurately controlled with interior IO between VM;And prevent from causing system crash Mistake driver.Other advantages include retain VMIO scalability and without considering the quantity of VM.
Although describing the present invention the most with reference to an illustrative embodiment, but this description is not limiting as the present invention. One of ordinary skill in the art, after with reference to this description, can recognize illustrative enforcement apparently The various amendments of example and combination, and other embodiments of the present invention.Therefore, appended claims meaning Figure contains any this type of amendment or embodiment.

Claims (19)

1. the method running virtual machine on the server, described method includes:
Use the first supervisory process run on the described server data to distributing to described virtual machine Path resources is controlled, and is controlled data path resources including being couple to the hard of described server The data path of part interface equipment is controlled;And
Use the second program run on the described server control path to described hardware interface device Being controlled with initializing resource, described second program is separated with described first supervisory process;
Launch between described virtual machine and described hardware interface via described first supervisory process and receive Packet;And described transmitting packet includes:
In guest user territory, start system call, to send packet;
Described packet is switched to passenger plane kernel from described guest user territory context;
Described packet is switched to described first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware interface device from described first supervisory process.
Method the most according to claim 1, wherein:
Described first supervisory process includes management program;And
Described second program includes controlling plane.
Method the most according to claim 1, the wherein described number to described hardware interface device It is controlled including the data path of NIC is controlled according to path.
Method the most according to claim 1, wherein controls data path resource and farther includes Described first supervisory process intercepts and imitates all memorizeies of being performed by described virtual machine and maps input/defeated Go out.
Method the most according to claim 4, wherein imitates and includes changing the slow of described virtual machine Rush address.
Method the most according to claim 1, wherein said first supervisory process monitors described number According to path resources, to prevent from violating safety.
Method the most according to claim 1, it farther includes to receive packet, receives institute State packet to include:
Described packet is transferred to described first supervisory process from described hardware interface device;
Described packet is switched to passenger plane kernel from described first supervisory process context;And
The system that described packet is switched to user domain from described passenger plane kernel context is called.
8., for the method running multiple virtual machine operations server systems, described method includes:
Loading Control planar process;
Load supervisory process;
By virtual machine instantiation;
Between described virtual machine and hardware interface device, data path money is controlled via described supervisory process Source;And
To the control path of described hardware interface device and resource is initialized via described control planar process It is controlled;
Launch between described virtual machine and described hardware interface via described supervisory process and receive data Wrap, and described transmitting packet include:
In guest user territory, start system call, to send packet;
Described packet is switched to passenger plane kernel from described guest user territory context;
Described packet is switched to the first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware interface device from described first supervisory process.
Method the most according to claim 8, wherein controls data path resource and farther includes Described supervisory process intercepts and imitates all memorizeies performed by described virtual machine and maps input/output.
Method the most according to claim 9, wherein simulation includes changing the slow of described virtual machine Rush address.
11. methods according to claim 8, wherein control data path resource and include data The queue of described hardware interface device it is mapped to from the queue of described virtual machine.
12. methods according to claim 8, wherein said supervisory process includes management program.
13. methods according to claim 8, wherein include described virtual machine instantiation many Individual virtual machine instantiation.
14. 1 kinds of data handling systems for virtual machine, described data handling system includes:
Processor;
It is couple to the memorizer of described processor;And
Interface port, it is used for being couple to hardware network interface equipment separate with described processor, wherein Described processor is used for:
Run the first supervisory process, described first supervisory process via described interface port virtual with described Data are transmitted between bag queue in bag queue that machine is relevant and described hardware network interface equipment;And
Running the second program, described in described second programme-control, the configuration of hardware network interface equipment, described Second program is separated with described first supervisory process;
Described first supervisory process launch between described virtual machine and described hardware network interface equipment and Receive packet;And described transmitting packet includes:
In guest user territory, start system call, to send packet;
Described packet is switched to passenger plane kernel from described guest user territory context;
Described packet is switched to described first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware network interface equipment from described first supervisory process.
15. data handling systems according to claim 14, wherein said processor is used further In running described virtual machine.
16. data handling systems according to claim 14, wherein said first supervisory process is Management program.
17. data handling systems according to claim 16, wherein said management program includes soon Speed path data driver, it is couple to that memorizer maps input/output and imitates described virtual machine Equipment.
18. data handling systems according to claim 14, wherein said hardware network interface sets Standby is NIC.
19. data handling systems according to claim 14, it farther includes described hardware net Network interface equipment.
CN201180045902.1A 2010-10-01 2011-09-28 The system and method that the input/output of virtual network is controlled Active CN103140830B (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US38907110P 2010-10-01 2010-10-01
US61/389,071 2010-10-01
PCT/US2011/053731 WO2012044700A1 (en) 2010-10-01 2011-09-28 System and method for controlling the input/output of a virtualized network
US13/247,578 2011-09-28
US13/247,578 US9213567B2 (en) 2010-10-01 2011-09-28 System and method for controlling the input/output of a virtualized network

Publications (2)

Publication Number Publication Date
CN103140830A CN103140830A (en) 2013-06-05
CN103140830B true CN103140830B (en) 2016-11-30

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1455566A (en) * 2002-12-20 2003-11-12 中国科学院沈阳自动化研究所 On-the-spot bus scatter control station

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1455566A (en) * 2002-12-20 2003-11-12 中国科学院沈阳自动化研究所 On-the-spot bus scatter control station

Similar Documents

Publication Publication Date Title
US8832688B2 (en) Kernel bus system with a hyberbus and method therefor
US9213567B2 (en) System and method for controlling the input/output of a virtualized network
Dong et al. High performance network virtualization with SR-IOV
US7945436B2 (en) Pass-through and emulation in a virtual machine environment
Dall et al. ARM virtualization: performance and architectural implications
CN102609298B (en) Based on network interface card virtualization system and the method thereof of hardware queue expansion
Nussbaum et al. Linux-based virtualization for HPC clusters
US20120254862A1 (en) Efficent migration of virtual functions to enable high availability and resource rebalance
US10852990B2 (en) Hybrid framework of NVMe-based storage system in cloud computing environment
US9529615B2 (en) Virtual device emulation via hypervisor shared memory
US20200150997A1 (en) Windows live migration with transparent fail over linux kvm
WO2013091221A1 (en) Enabling efficient nested virtualization
Shea et al. Network interface virtualization: challenges and solutions
Mohebbi et al. Zivm: A zero-copy inter-vm communication mechanism for cloud computing
Zhang et al. High-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clusters
US10164911B2 (en) Shim layer used with a virtual machine virtual NIC and a hardware platform physical NIC
US20210209040A1 (en) Techniques for virtualizing pf-vf mailbox communication in sr-iov devices
CN117389694B (en) Virtual storage IO performance improving method based on virtio-blk technology
Ma et al. InfiniBand virtualization on KVM
CN114397999A (en) Communication method, device and equipment based on nonvolatile memory interface-remote processing message transmission
Schroeder et al. VISAGE: An object-oriented scientific visualization system
Chang et al. Virtualization technology for TCP/IP offload engine
CN103140830B (en) The system and method that the input/output of virtual network is controlled
Mouzakitis et al. Lightweight and generic RDMA engine para-virtualization for the KVM hypervisor
WO2017026931A1 (en) Implementing input/output in a virtualized environment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant