CN103140830B - The system and method that the input/output of virtual network is controlled - Google Patents
The system and method that the input/output of virtual network is controlled Download PDFInfo
- Publication number
- CN103140830B CN103140830B CN201180045902.1A CN201180045902A CN103140830B CN 103140830 B CN103140830 B CN 103140830B CN 201180045902 A CN201180045902 A CN 201180045902A CN 103140830 B CN103140830 B CN 103140830B
- Authority
- CN
- China
- Prior art keywords
- virtual machine
- packet
- supervisory process
- passenger plane
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 claims abstract description 76
- 238000011068 load Methods 0.000 claims description 14
- 238000004088 simulation Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000003139 buffering Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 244000045947 parasites Species 0.000 description 3
- 230000003362 replicative Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 239000011800 void material Substances 0.000 description 2
- 230000003213 activating Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000011030 bottleneck Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000004059 degradation Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 235000010384 tocopherol Nutrition 0.000 description 1
- 230000001131 transforming Effects 0.000 description 1
- 235000019731 tricalcium phosphate Nutrition 0.000 description 1
- 230000001960 triggered Effects 0.000 description 1
Abstract
According to an embodiment, a kind of method running virtual machine on the server, described method includes: uses the first supervisory process (310) run on described server to be controlled the data path resource distributing to described virtual machine, is controlled data path resources including being controlled the data path of the hardware interface device (316) being coupled to described server;And using the second program (304) run on described server that the control path of described hardware interface device and initialization resource are controlled, wherein said second program (304) is separated with described first supervisory process (310).
Description
What application claims was submitted on October 1st, 2010 invention entitled " accelerates network path: use
Many queues of NIC auxiliary improve the new method of virtual network input/output performance " the
It is invention entitled " to void that No. 61/389,071 U.S. Provisional Application case and JIUYUE in 2011 are submitted on the 28th
Intend the system and method that is controlled of input/output of network " No. 13/247,578 U. S. application case
Earlier application priority, the content of the two earlier application is expressly incorporated herein in this in the way of introducing.
Technical field
The present invention relates to computer server, and in a particular embodiment, relate to a kind of for virtual net
The system and method that the input/output (IO) of network is controlled.
Background of invention
It is known that Intel Virtualization Technology can i.e. service (IaaS) as the core infrastructure that various clouds are disposed
Key.In past about 10 years, the progress of x86 hardware auxiliary is to focus on performance and the virtualization of scalability
Solution has paved road.Management program, also referred to as virtual machine monitor (VMM), for conduct
The operating system (OS) that passenger plane (guest) runs uses software instruction interception mechanism to imitate centre
Reason device (CPU), memorizer and I/O resource.The VMM correctly write can provide to passenger plane OS can
Lean on, safe and exact system view.
VMM can use CPU, memorizer and the IO response that half virtualization guarantees to be concerned with.For CPU
For virtualization, before Intel Virtualization Technology (VT) extension is implemented, the VMM such as such as VMWare/Xen
Use Math input/output (BIOS), CPU scheduler and the ring of complexity and command simulation,
With the fictitious order on " capture and perform " actual core cpu.But, modern CPU can process hospitable
Machine is to root partition, and unloads the virtualized heavy burden of CPU for VMM.This is generally by from physics
" a collection of " executed instruction set of passenger plane on CPU provides cpu instruction to realize.Such as by English
Te ErAnd Advanced Micro Devices Inc.Main X86-based offer safety Deng production
Virtual machine (SVM)/virtual machine extension (VMX) instruction, to carry out VMENTER/VMEXIT behaviour
Make.Additionally, CPU is improved, such as, the bypass conversion buffered district of labelling (TLB), suspend-follow
Ring exits (PLE), and virtual processor identifier (VPID), thus reduces passenger plane CPU and share
Required cpu cycle number.Memory optimization is also focused on by making page table have shade, optimize by VMEXIT
The guest address conversion caused and expense are to reduce passenger plane page iteration.
I/O virtualization is still raising virtualization IO performance and the importance of scalability, and network I/O is
One of core aspect of the most concerned network virtualization.The most queues are applied, for direct equipment
The bases such as the Intel Virtualization Technology (VT-d) based on Intel of connection, single I/O virtualization (SR-IOV)
In Intel Virtualization Technology and the virtualization terminal based on software thereof of hardware, such as, Xen xegregating unit drives
Territory (IDD) and network channel (Netchannel) 2, to improve the overall network interface card of each passenger plane
(NIC) performance, but at aspects such as such as CPU loading, architecture cost and Information Securities
Still have much room for improvement.
Summary of the invention
According to an embodiment, a kind of method running virtual machine on the server, comprising: use
The data path resource distributing to described virtual machine is entered by the first supervisory process run on described server
Row controls, and is controlled including the hardware interface device to being couple to described server to data path resources
Data path be controlled;And use the second program run on described server to described hardware
The control path of interface equipment and initialization resource are controlled, wherein said second program and described first
Supervisory process is separately.
It is according to another embodiment, a kind of for the method running multiple virtual machine operations server systems,
Comprising: Loading Control planar process;Load supervisory process, by virtual machine instantiation, via described prison
Tube side sequence controls data path resource between described virtual machine and hardware interface device, and via described
Control planar process the control path of described hardware interface device and initialization resource are controlled.
According to another embodiment, a kind of data handling system for virtual machine, comprising: processor;
It is couple to the memorizer of described processor;And interface port, it is used for being couple to divide with described processor
The hardware network interface equipment opened.Described processor for run the first program, described first program via
Described interface port bag team in the bag queue relevant to described virtual machine and described hardware network interface
Data are transmitted between row.This program is further used for running the second program, described in described second programme-control
The configuration of hardware network interface equipment, described second program is separated with described first program.
According to another embodiment, a kind of non-momentary computer-readable media stores executable program.Institute
State program instruction processor perform following steps: Loading Control planar process;Load supervisory process;By void
Plan machine instantiation;Between described virtual machine and hardware interface device, data are controlled via described supervisory process
Path resources;And via described control planar process to the control path of described hardware interface device and just
Beginningization resource is controlled.
Outline the feature of the embodiment of the present invention the most widely, thus contribute to being more fully understood that
Hereafter detailed description of the invention.Will be discussed below other features of the every embodiment of the present invention and excellent
Point, these feature and advantage constitute the subject matter of claims of the present invention.The technology people of art
Member is it will be appreciated that can revise easily based on disclosed concept and specific embodiment or be designed for realizing
Other structures of the identical purpose of the present invention or process.Those skilled in the art should be further appreciated that this
Class equivalent structure is without departing from the spirit and scope of the present invention proposed in appended claims.
Accompanying drawing is sketched
In order to be more fully understood from the present invention and advantage thereof, in conjunction with accompanying drawing with reference to following description, wherein:
Fig. 1 show the framework of prior art Xen network channel (Netchannel)-2;
Fig. 2 show the embodiment of multiqueuing system framework;
Fig. 3 a to Fig. 3 c show transmitting, receive and control the block diagram of embodiment in path;
Fig. 4 show the embodiment of the network equipment;And
Fig. 5 show the embodiment of processing system.
Detailed description of the invention
Enforcement to current most preferred embodiment and use are discussed in detail below.It is to be understood that the present invention carries
Can application invention concept for can be used for various many most hereinafter.The specific embodiment discussed is only
Only illustrate to make and use the concrete mode of the present invention, and do not limit the scope of the invention.
Some embodiment method is the alternative method of existing half virtualization scheme, and includes the component of uniqueness
(building block), described component utilizes many queues performance based on NIC hardware, reduces CPU simultaneously
Context switches.The system and method for embodiment provides I/O virtualization replacement scheme, and the program uses each
Plant management program environment, include, but not limited to Xen and manage program environment.
Some embodiment system and method (1) is launched (TX)/reception (RX) operation for bag and is removed many
Remaining context switching;(2) many queues NIC is utilized;(3) PV or hardware auxiliary virtualization it are applicable to
(HVM) territory is arranged;(4) and utilize similar VT-d to be directly connected to the performance of equipment, thus formed can
Scaling, low latency and power managed more preferable network I/O channel.In certain embodiments, such as,
Intel can be usedImplement many queues NIC.Or, other NIC device can be used.
Some embodiment system includes for the control path at a slow speed of x86 system and dividing of rapid data path
Cut the network equipment and IO(Split Network Device&IO) view.This is applicable to such as store
Other equipment such as equipment.Embodiment system and method also includes exposing the I/O device in management program environment
The method that depositor processes and NIC queue maps, and/or compared with existing half virtual method, reduce
CPU context switches.Other system also can increase virtual machine (VM) for each core cpu/thread
I/O latency and VM page cache locality.
Some embodiment system can be implemented by changing and extending Xen framework.This type of solution also accords with
Close CPU VT extension.In an examples Example, Xen4.0,2.6.32 kernel be used as at 64 and
The territory 0(Dom0 of operation in the Linux passenger plane environment of 32/64).Or, other environment can be used.
The machine environment such as similar Xen use emulator model (QEMU) or half virtualization to imitate net
Network IO.One reason of performance degradation is the cost of the virtualization I/O device of each VM.In recent years,
Overall CPU and the memorizer of virtualization passenger plane has been improve for the auxiliary development of virtualized hardware
Can, so that virtualization passenger plane almost can realize the performance of the machine.The network equipment and operation thereof are a pair core
Disposition energy sensitive indicator.When relating to TX/RX data, from server to desktop computer and notebook computer
Any operation system and the application program that wherein runs all rely on almost the performance of " lossless ".Virtual
Under change, network I/O performance becomes more crucial.Therefore, embodiment method is focused on improving network I/O.Though
Some embodiment the most described herein uses Xen, as sample management programming system, embodiment is described
System, but the management program of other embodiment system and method is unknown, and be typically enough to use other
X86 hypervisor work, includes, but are not limited to kernel virtual machine (KVM).
In one embodiment, QEMU device model and half virtualization all use software I O virtualization skill
Art.QEMU is to have in the original open source software I O emulation device of PIIX3/PIIX4, described
PIIX3/PIIX4 is pre-ICH South Bridge chip, and it is (soft that it can imitate BIOS, ACPI, IDE controller
Dish, CD-ROM, hard disk), framebuffer device, and the network controller as master controller
(RealTek, e1000).Additionally, the most added more self-defined extension.This type of embodiment arranges and allows
I/O channel from passenger plane operates TX/RX(and by memory-mapped io (MMIO) and directly stores
Device access (DMA)) special permission passenger plane in each physical device on by gpfn to mfn change, in
The page table of disconnected process and complexity moves thus operates with " capture and forward " pattern.These situations can make
Emulator has " actual " device map.The subject matter bag of emulator model based on QEMU
Include performance, extensibility and power managed.These measurement results are soft with each system based on management program
" context switching " cost between the kernel/user model of part layer is directly related.
The most hereafter switching between passenger plane and host subscriber's pattern, kernel mode comes with management program
Switchback is changed and exponential instructions can be made to be loaded on CPU, consequently, it is possible to cause performance delays, scalability to drop
Low, instruction execution delay reduces and power consumption load increases.A kind of original PV method be " shearing " based on
The emulation of QEMU, is used for processing I/O channel operation (DMA/MMIO), and is driven by physical device
Dynamic program is direct in TX/RX relief area is affixed directly to physical equipment.Including self-defined rear/front wheel driving
Modified passenger plane/the dom0/Xen of program is set to memorizer change page (authorization list), and makes passenger plane
TX/RX Buffer mapping is to physical device TX/RX.Context switching surfaces in original PV method is very
Height, this is owing to TX/RX activity can switch through at least 4 contexts, such as, from passenger plane to
Family/kernel, to Dom0 to management program, thus results in that cycle of each bag is longer and TLB expense is higher.
He Saileinatuosangtuosi (Jose Renato Santos), farmland on a plateau Tener (Yoshio Turner), G are (about
Writing brush) Jia Naji Raman (G. (John) Janakiraman), and her grace Alexandre Desplat (Ian Pratt) exists
The technical report HPL-2008-39 of HP Lab " shortens I/O virtualized software and hardware technology
Between gap (Bridging Gap between Software and Hardware Techniques for I/O
Virtualization) proposing a kind of alternative method in ", described document is hereby incorporated herein by this.
IDD rather than Dom0 is used as equipment I O Channel Host by the method, to improve passenger plane I/O channel performance.
One problem of this solution is, the quantity of " intercept " element relevant with context switching is the most unchanged
Change.Additionally, due to more VM can cause delay, therefore, handling capacity significantly reduces.The method is also
Have that VM TX/RX queue is too much and (subscription) problem is subscribed in inequitable queue.Most
NIC has single TX and the RX queue from VM, and this queue exists queue and consumes problem excessively.
Such as, the VM of the application program with height " numerous and diverse " can allow this VM use TX/RX relief area to account for
There is physical device queue, and make other VM " lack " I/O channel operation.
Hardware is the most fast-developing, to support the hardware based virtualization auxiliary for managing program.NIC
Hardware business has produced based on SR-IOV and card based on many queues performance, thus " unloading " is based on software
TX/RX buffer circle.Card based on SR-IOV uses the PCI configuration of change and based on PF/VF
Master-slave equipment driving model (master and slave device driver model), thus by NIC IO
Channel maps directly to eliminate the VM of master monitor dependency as individually " physical equipment ".
Fig. 1 show Xen network channel-2100 of the many queue devices of support with IDD, and it is existing
There are I/O channel framework and stream.Xen PV architecture includes: rear end/front-end driven program, memory page
Share, and the data structure that management program is supported, such as, the event for interrupt notification and readjustment is believed
Road, for the hypercalls of inter-domain communication and authorization list.The network I/O Channel view of passenger plane depends on visitor
Machine type, i.e. whether passenger plane operates under PV or HVM pattern.Tradition HVM pattern refers to,
Passenger plane OS is not carrying out the situation of any amendment to its kernel or user model parts or its life cycle
Lower operation.
Xen management program 110 is to process memorizer management, VCPU scheduling and interrupt schedule
Little management program.Dom0 to PV is arranged, including authorization list, event channel and shared memorizer,
Support all IO, to control and to represent passenger plane.Dom0 to QEMU device model via
Netfront/Netback driving mechanism obtains virtualization IO and arranges or partly virtualize setting.Additionally, completely
Virtualization or paravirtualized DomU passenger plane receive all of I/O service from Dom0.Due to regard to CPU
For cycle, Full-virtualization IO is much more expensive, and therefore the where the shoe pinches of the method is that system operates
Slower.Although half virtualization IO handling capacity when passenger plane quantity is few is good, but this handling capacity can be along with visitor
The increase of machine quantity and reduce rapidly.In both cases, I/O latency does not becomes with the passenger plane number of increase
Ratio.This represents that the scalability of IaaSV cloud VM IO has difficulties.
About current I/O latency, PV uses and shares memorizer, relates to travelling to and fro between the bright of operation to use
The memorizer really replicated " authorizes " mechanism mobile TX/RX data between passenger plane and Dom0.Passenger plane and
Event channel between Dom0 is used for activating TX/RX activity-triggered device.Described event is actually interrupted
Dom0 or passenger plane, be specifically dependent upon the direction of data.Additionally, carry out interrupting transmission to need context
Being switched to management program, context is switched to predetermined Dom the most again.Therefore, PV method is added up and down
Literary composition switching, thus add delay, and then damage the scalability of VM.
For PV passenger plane network I/O stream, in PV situation, guest virtual physical address (VPA)
Space is varied to map Xen memorizer (Xen-u).Generally, Xen memorizer includes about 12M's
Map memorizer, and direct and the front-end driven program of Xen-VPA interaction, this front-end driven
Program includes I/O channel processing routine.Dom0 preserves backend driver, the machine NIC driver
And I/O event channel, it is for driving the Dom0 of passenger plane TX/RX relief area/packet stream to share memorizer
Data structure.Equipment arranges and generally completes under OS bootstrap.OS is for the base including NIC card
Plinth hardware is detected, and arranges virtual pci configuration space parameter and break path (interrupt
Routing).General networks IO includes meshwork buffering district, is expressed as TX/RX relief area.Virtual network
The middle use identical concept, but, owing to virtual network IO is more focused on passenger plane network I/O performance, therefore
The CPU of virtual network IO on virtual machine nodes is relatively costly.There is such as network in old PV method
Bridge implements suboptimum and the saturated shortcoming such as higher of CPU.CPU is saturated also implies that VM application program
The available cycle is less, thus for cloud computer IaaS, the method is less feasible.Network channel 2
It also is adapted for utilizing many queues NIC and/or the NIC of SR-IOV being carried out.
In one embodiment of the present invention, switch by reducing context, the PV framework of change props up
Holding multiple HW queue, the PV framework of described change improves IO scalability and allows VMDQ to postpone.
This is realized by the dual-view (dual view) using NIC device.In one embodiment, Dom0
In NIC driver for executive control operation, carry out including to pci configuration space and break path
Controlling, mainly MMIO is movable in this control operation.Fast-path queue memorizer is mapped in management journey
Sequence driver processes.The two driver can check (see) device register group, control knot
Structure etc..In one embodiment, RX/TX DMA transfer has up to three context switchings, such as
Xen, kernel 0 and user replicate.
Fig. 2 show many queues (VMQ) system 200 according to one embodiment of the present invention.Ying Liao
Solve, although describe VMQ system for purpose of explanation, but embodiment concept also apply be applicable to alternative system.
VMQ system 200 uses network channel-2 structure of improvement, and described structure has the parts that band works
Many queues TX/RX data stream.In one embodiment, relative to existing network channel-2 structure
Add and/or change IGB netU driver 210, IGBX emulator 222, IGBX emulation PC I
C configuration 230, IGBU-1 driver, IGBU-n driver 232, IGB0NIC driver
226, netUFP block 218 and fast path data processor 220.
In one embodiment, MMIO controls stream IGB from the passenger plane kernel spacing of passenger plane DomU
The NetU driver 210 IGBX emulator 222 in special permission passenger plane Dom0.Then, control
Stream processed continues through the IGB0NIC in IGBU-1224 driver and Dom0 kernel spacing and drives journey
Sequence, and eventually arrive at the NIC card 212 in hardware.On the other hand, data stream is at passenger plane Dom0 kernel
Between IGB NetU driver 210 in space, through RX relief area 216 and tx buffering district
214 and arrive the NetUFP block 218 in management program 208.Data stream continues from NetUFP block 218
Flow to the fast path data processor 220 also in management program 208, and flow on NIC212.
By being removed as replicating bottleneck by Dom0, this embodiment method can be offset and be driven journey by netback
The passenger plane mandate completed in sequence replicates the higher CPU cost caused such as, for the typical net of VM
Typical case's bag RX path of network channel 2 has steps of: user replicates;Data are multiple from passenger plane kernel
System is to user buffering district;Carry out passenger plane kernel coding;Xen code is performed in passenger plane context;Authorize multiple
System;Kernel coding is carried out in Dom0;And in Dom0 kernel context, perform Xen code.Institute
These steps are had to include seven context switchings in PV code.Network channel 2 optimizes mandate duplicate stage,
Method is to move this function to passenger plane, thus improves by bag is placed in passenger plane cpu cache context
RX performance, and then unloading Dom0CPU caching.This can eliminate Dom0 duplicate stage.Network channel
2 are also used for making the passenger plane caching that is mapping through of RX descriptor to map directly to Guest memory.Return to figure
1, Netfront driver in passenger plane territory 104 102 by I/O channel to I/O buffer mandate,
To allow many queue devices driver use described Netfront driver.Territory 108 is driven to preserve
Netback106 and licensing scheme.The page that described driving territory is responsible in TX/RX activity relating to is retouched
State directly locking and the safety of symbol.
It is multiple that the network channel 2 of the direct queue mapping with each passenger plane RX path has user model passenger plane
System, wherein packet replication replicates be mapped to RX queue to local Guest memory, described user model passenger plane
Memory descriptor, thus reduce the mandate duplication in relief area doubling time, passenger plane kernel mode, Dom0
In kernel mode, and Xen code.Authorizing after replicating raising, RX performance will necessarily improve.Example
As, the consumption of physical cpu is down to 21% from 50%.But, the quantity of context switching only reduces one,
That is, the Dom0 user model mandate copy step in original PV method.TX path has similar upper
Hereafter number of handovers.The entire throughput of the cost impact system of context switching, and finally affect system
Scalability.Additionally, the context number of handovers between passenger plane/Dom0/Xen also can produce thousands of CPU
Cycle, thus affect overall power.Therefore, embodiment method utilizes the many teams being similar to network channel 2
Row, but the quantity switched by context is reduced to three from six, thus eliminate mandate duplication, Dom0 user
And the switching of kernel context.A kind of embodiment method depends on MSI, and in the availability of relief area/
MSI is distributed to each passenger plane queue IRQ by outer skb.
In one embodiment, above-mentioned framework is divided into two parts: TX/RX data path, referred to as through street
Footpath;And equipment initializes/controls path, referred to as slow-path.The two path all sets as emulation
For presenting to passenger plane, as two kinds of standalone features of same device PCI.Slow-path equipment is common
Emulator, its all of function is by QEMU process.Fast path equipment is PCI function, its
Pci configuration space is still imitated by QEMU, but Parasites Fauna emulation is by management routine processes.Fast path
Also use queue that dma operation is set.These queues map directly in guest address space, but interior
Hold and quickly changed by management program, imitate fast path Parasites Fauna simultaneously.In one embodiment, when
When each MSI-X vector is further exposed to passenger plane as RX relief area availability, there is MSI-X
Support.Flow interruption is forwarded directly to passenger plane by management program.
In one embodiment, queue relevant rudimentary framework is implemented by such as the hardware of passenger plane architecture.
Interrupt being alternatively each passenger plane resource, therefore can use message signal interrupt (MSI-X).An enforcement
In example, each queue uses an interrupt vector, and another interrupt vector is used for controlling.Or, often
Individual passenger plane can use an interrupt vector.In one embodiment, MSI-X vector be sent to passenger plane it
Before, first become INT-X to interrupt by management Program transformation.In one embodiment, Dom0 can check completely
NIC, pci configuration space, Parasites Fauna and MSI-X interrupt capabilities.NIC queue can set at equipment
Put period directly mapping.
In one embodiment, IGB0NIC driver is used as secondary driver (side driver), with
Process all of initialization, control and faulty operation.If Dom0 also uses NIC to send and connect
Receive network data, then queue is controlled by Dom0 driver the most completely.Driver is only limited to it
Each territory resource (queuing register, interrupt bit etc.) of body and do not contact any passenger plane resource.
In one embodiment, IGBU is used as DomU pair driver, to initialize the control of their own
Device part processed.This IGBU processes the TX/RX queue of oneself and relevant depositor, and oneself's limit
It is made as being only limited to each territory resource of oneself, without contacting any other passenger plane resource.A reality
Execute in example, if increased in terms of safety under performance impact, then can force in management program
Perform this self limit of passenger plane.
In one embodiment, QEMU presents new emulator to passenger plane, and equipment configuration is led in process
All entrances in space.Passenger plane can be appreciated that in order to MSI-X interrupt vector is connected to correct INT-X
QEMU hypercalls.Perform another hypercalls, with allow management program understand its have to for visitor
The scope of the MMIO that the machine network equipment imitates.One important parameter is passenger plane pond number, and it is passenger plane
The index (index) of any hardware resource.In one embodiment, QEMU is deposited by PCI configuration
Device can allow passenger plane see this number.
As for management program, at the initialization time of territory, create the example that fast path equipment controls, its
The NIC hardware resource belonging to given passenger plane is controlled.QEMU calls applicable hypercalls
After arranging MMIO simulation scale, management program can intercept and imitate by passenger plane by described passenger plane
All MMIO that network driver completes.One embodiment aspect of emulation is conversion passenger plane buffering ground
Location.It is to preserve queue address write depositor (register that management program completes the mode of this situation
Write) the passenger plane queue address on, all addresses that the most described management program checkout is write by it
It is converted into the queue on rear of queue write depositor.
Fig. 3 a show the high level block diagram of embodiment DMA transmitted data path 300.An embodiment
In, launch stream and start from passenger plane application-level.Such as, application call DomU304 is
Tracking sends packet with mailing to (...) 306.System is called and is mail to the switching of (...) 306 context
To the passenger plane network driver NetUDrv308 of driving simulation equipment, therefore, in device register group
The each MMIO completed will be captured in management program 310.Management program 310 uses and is subordinated to mould
The equipment of block NetUFP312 imitates MMIO and drives hardware, thus uses accordingly at fast path
Reason program module 314.
Owing to passenger plane driver cannot know machine frame number, and guest physical frame number can only be accessed, therefore,
Conversion of page is rapidly completed while imitating MMIO.Then, by the DMA electromotor of NIC316
It is set to start to transmit data on circuit.After packet 318 sends, during NIC316 beams back
Disconnected.Described interruption is forwarded to passenger plane kernel by management program 310.The process journey that passenger plane kernel calls is suitable for
Sequence is purged.
Fig. 3 b show embodiment DMA and receives the high level block diagram of data path 320.An embodiment
In, receive stream and start from coming the packet 324 of automatic network.Packet passes through DMA completely into system
After memorizer, NIC316 sends to management program 310 and interrupts.Then, management program 310 is by described
Interrupt being forwarded to passenger plane kernel.Interrupt handling routine in passenger plane NetUDrv308 uses MMIO to examine
Look into the number of buffer of filling, then those relief areas are forwarded to top network stack.
Then, by new Buffer allocation and arrange in RX queue.Setting up procedure with MMIO to team
Row tail depositor terminates.This MMIO is captured by managing program 310.This spy in NetUFP312
The supervisory program simulation of different depositor also processes the conversion of all buffer addresses of write RX queue.
Fig. 3 c show embodiment MMIO and controls the high level block diagram in path 330, and it is imitated as complete
True equipment is implemented.In one embodiment, this emulation is performed by the QEMU332 in Dom0302.
QEMU332 is through Dom0NIC driver Net0Drv334, to control and to receive to travel to and fro between NIC
The control event of 316.Emulator is sent to passenger plane, as the same emulation net for data path
2nd PCI function of network equipment.In one embodiment, single interrupt vector is distributed to this PCI
Function, and QEMU332 uses this interrupt vector that control event is sent to passenger plane driver, described
QEMU processes these operations in NetUDrv308.Come from what the NetUDrv308 of passenger plane completed
MMIO is forwarded back to the QEMU332 in Dom0302.Due to compared with data path, and infrequently
Executive control operation, therefore, data throughout or data delay are seldom had or do not have by the delay of these operations
Have an impact.
In one embodiment, service implementation device can use the memorizer of 25GB, based on Intel
There is on the four core CPU of 5500 Intel of many queues (VMD-q) ability1GBE controls
Device, described four core CPU use 2.6.32 kernel as Dom0 and at 64 and 32/64 Linux
Passenger plane environment operates.The TCP bag using Ethernet bag size to be 52 bytes and 1500 bytes to the maximum enters
Row is measured.It will be appreciated that this server configures is only intended to implement many services of embodiments of the invention
One example of device configuration.
Embodiment is accelerated network path embodiment and is removed the relief area replication overhead occurred in licensing scheme.
Such as, the Dom0 expense during I/O data transmission is 0, and Dom0CPU caching is not made
Passenger plane numeric field data pollute.Owing to NIC queue memory maps directly to passenger plane, therefore, decrease
The computation burden of VCPU scheduler program, thus improve the fairness of credit scheduler program.Finally, move
The dynamic TX/RX data being transferred to passenger plane OS make the distribution after RX operates of passenger plane OS driver slow
Rush district, thus in the reproduction process of relief area, better control over data buffer storage alignment.It is right that this also reduces
The pollution of DomU cpu cache.It is real that table 1 shows that existing network channel 2 and embodiment accelerate network path
Execute the performance comparison of scheme.Here, it can be seen that after there is Dom0CPU load, with network channel
2 channelling modes are compared, and along with passenger plane number increases, embodiment is accelerated network path embodiment and had more
Linear I/O latency response.
Table 1:TX/RX performance comparison
At least some of described embodiment feature and method may be implemented in the network equipment or parts, such as,
Ethernet or Internet protocol (IP) node.Such as, the feature/method in the present invention can use hardware,
Firmware and/or through install with on hardware run software implement.The network equipment/parts or node can be
Any equipment of transmission frame in the such as network such as Ethernet or IP network.Such as, the network equipment/parts
Bridge, transducer, router, or the various combinations of this kind equipment can be included.As shown in Figure 4, net
Network equipment/parts 400 comprise the steps that multiple inbound port 402 or unit, in order to receive frame from other nodes;
Logic circuit 406, in order to determine the destination node sending frame;And multiple outbound port 404 or unit,
In order to send frames to other nodes.
Fig. 5 show the processing system 600 that can be used for implementing the method for the present invention.In the case, main
The place's reason processor 602 wanted performs, and described processor can be microprocessor, Digital Signal Processing
Device or any other processing equipment being suitable for.In certain embodiments, processor 602 can be by multiple process
Device is implemented.Program code (such as, implementing the code of above-mentioned algorithm) and data can be stored in memorizer
In 604.Described memorizer can be local storage or the mass storages such as such as DRAM, such as,
Hard disk drive, CD drive or other memorizeies (it can be Local or Remote).Although using single
Individual block functionally illustrates memorizer, it is to be understood that one or more hardware block can be used to implement this merit
Energy.
In one embodiment, processor can be used for implementing some in each unit shown in Fig. 5 or all lists
Unit.Such as, described processor can be used as specific functional unit at different time, to implement to perform this
Subtask involved during bright technology.Or, different hardware blocks can be used (such as, with processor phase
Same or different) perform difference in functionality.In other embodiments, some subtask is performed by processor,
Other subtasks then use single circuit to perform.
Fig. 5 also illustrates I/O port 606, and it is used as the interface of Network Interface Unit.Network Interface Unit
608 can implement as NIC (NIC), described NIC as described above and according to
Embodiments thereof described above configures, and described Network Interface Unit provides interface for network.
According to an embodiment, a kind of method running virtual machine on the server, comprising: use
The data path resource distributing to described virtual machine is entered by the first supervisory process run on described server
Row controls, and is controlled including the hardware interface device to being couple to described server to data path resources
Data path be controlled;And use the second program run on described server to described hardware
The control path of interface equipment and initialization resource are controlled;Wherein said second program and the first program
Separately.In one embodiment, described first supervisory process can be management program, and described second program
Can be to control plane.In certain embodiments, the data path to hardware interface device is controlled including
The data path of NIC (NIC) is controlled.
In certain embodiments, control described data path resource and farther include the first supervisory process interception
And imitate all memorizeies mapping input/output (MMIO) performed by described virtual machine.Imitation can be wrapped
Include the buffer address changing described virtual machine.In certain embodiments, the first supervisory process monitors described number
According to path resources, to prevent from violating safety.
In one embodiment, described method also includes via described first supervisory process at described virtual machine
And launch between hardware interface and receive bag.Launch packet and comprise the steps that startup system in guest user territory
Tracking use, to send packet;Described packet is switched to passenger plane from described guest user territory context
Kernel;Described packet is switched to described first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware interface device from described first supervisory process.Additionally, reception number
Comprise the steps that according to bag and described bag is transferred to the first supervisory process from hardware interface device;By described packet
It is switched to passenger plane kernel from the first supervisory process context;And by described packet from described passenger plane kernel
The system that context is switched in user domain is called.
It is according to another embodiment, a kind of for the method running multiple virtual machine operations server systems,
Comprising: Loading Control planar process;Load supervisory process, by virtual machine instantiation;Via described prison
Tube side sequence controls data path resource between described virtual machine and hardware interface device;And via described
Control planar process the control path of described hardware interface device and initialization resource are controlled.At certain
In a little embodiments, described supervisory process includes management program.Control data path resource and can include described prison
Pipe program intercepts also imitates all memorizeies mapping input/output (MMIO) performed by described virtual machine.
In one embodiment, simulation includes the buffer address changing described virtual machine.In one embodiment,
Control data path resource can include from the queue of described virtual machine, data are mapped to described hardware interface
The queue of equipment.
In one embodiment, described method farther includes via described supervisory process at described virtual machine
And launch between described hardware interface and receive bag.Additionally, in certain embodiments, by described virtual machine
Instantiation includes multiple virtual machine instantiation.
According to another embodiment, a kind of data handling system for virtual machine, comprising: processor;
It is couple to the memorizer of described processor;And interface port, it is used for being couple to divide with described processor
The hardware network interface equipment opened.Described processor for run the first program, described first program via
Described interface port bag team in the bag queue relevant to described virtual machine and described hardware network interface
Data are transmitted between row.This program is further used for running the second program, described in described second programme-control
The configuration of hardware network interface equipment, described second program is separated with described first program.An enforcement
In example, described processor can be used for running described virtual machine.
In one embodiment, the first program is management program, and it can include fast path data-driven journey
Sequence, described driver is couple to that memorizer maps input/output (MMIO) and imitates described virtual
The equipment of machine.In certain embodiments, described hardware network interface equipment is NIC (NIC).
Some system may also include hardware network interface equipment itself.
According to another embodiment, a kind of non-momentary computer-readable media stores executable program.
Described program instruction processor execution following steps: Loading Control planar process;Load supervisory process;Will
Virtual machine instantiation;Between described virtual machine and hardware interface device, number is controlled via described supervisory process
According to path resources;And via described control planar process to the control path of described hardware interface device and
Initialization resource is controlled.The step controlling data path resource can farther include described supervisory process
Perform following steps: intercept and imitate all memorizeies performed by described virtual machine and map input/output
(MMIO).Additionally, imitate the buffer address that step can include changing described virtual machine.An enforcement
In example, described supervisory process can be management program.
In certain embodiments, the described step controlling data path resource comprises the following steps: by data
The queue of described hardware interface device it is mapped to from the queue of described virtual machine.Described program also can be further
Described processor is indicated to further perform the step of: via described supervisory process at described virtual machine with hard
Launch between part interface and receive bag.
The advantage of embodiment includes: context number of handovers reduces;By using via utilizing many queues NIC
Realize direct IO page-map and reduce CPU cost;Passenger plane VMIO postpones to increase, with existing
PV implements to compare, and this directly affects scalability;And enter the ability of x86-64 and x86-32 bit pattern.
The bonus effect that cpu load reduces also improves the scalability of VM, and makes effect management more preferably.
Other advantages include improving I/O latency.Such as, with such as other nearest PV such as network channel 2 grade
Method is compared, and embodiment I/O mechanism can improve, such as, the I/O latency of about 68%, and with tradition PV
Method is compared, and improves about 100%.
Additionally, compared with such as current PV embodiment such as network channel 2 grade, exemplary embodiment shows
Write and benefit is provided, the advantage simultaneously remaining current PV, such as, guest address spatial separation;For
To the distinguished IO capture accurately controlled with interior IO between VM;And prevent from causing system crash
Mistake driver.Other advantages include retain VMIO scalability and without considering the quantity of VM.
Although describing the present invention the most with reference to an illustrative embodiment, but this description is not limiting as the present invention.
One of ordinary skill in the art, after with reference to this description, can recognize illustrative enforcement apparently
The various amendments of example and combination, and other embodiments of the present invention.Therefore, appended claims meaning
Figure contains any this type of amendment or embodiment.
Claims (19)
1. the method running virtual machine on the server, described method includes:
Use the first supervisory process run on the described server data to distributing to described virtual machine
Path resources is controlled, and is controlled data path resources including being couple to the hard of described server
The data path of part interface equipment is controlled;And
Use the second program run on the described server control path to described hardware interface device
Being controlled with initializing resource, described second program is separated with described first supervisory process;
Launch between described virtual machine and described hardware interface via described first supervisory process and receive
Packet;And described transmitting packet includes:
In guest user territory, start system call, to send packet;
Described packet is switched to passenger plane kernel from described guest user territory context;
Described packet is switched to described first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware interface device from described first supervisory process.
Method the most according to claim 1, wherein:
Described first supervisory process includes management program;And
Described second program includes controlling plane.
Method the most according to claim 1, the wherein described number to described hardware interface device
It is controlled including the data path of NIC is controlled according to path.
Method the most according to claim 1, wherein controls data path resource and farther includes
Described first supervisory process intercepts and imitates all memorizeies of being performed by described virtual machine and maps input/defeated
Go out.
Method the most according to claim 4, wherein imitates and includes changing the slow of described virtual machine
Rush address.
Method the most according to claim 1, wherein said first supervisory process monitors described number
According to path resources, to prevent from violating safety.
Method the most according to claim 1, it farther includes to receive packet, receives institute
State packet to include:
Described packet is transferred to described first supervisory process from described hardware interface device;
Described packet is switched to passenger plane kernel from described first supervisory process context;And
The system that described packet is switched to user domain from described passenger plane kernel context is called.
8., for the method running multiple virtual machine operations server systems, described method includes:
Loading Control planar process;
Load supervisory process;
By virtual machine instantiation;
Between described virtual machine and hardware interface device, data path money is controlled via described supervisory process
Source;And
To the control path of described hardware interface device and resource is initialized via described control planar process
It is controlled;
Launch between described virtual machine and described hardware interface via described supervisory process and receive data
Wrap, and described transmitting packet include:
In guest user territory, start system call, to send packet;
Described packet is switched to passenger plane kernel from described guest user territory context;
Described packet is switched to the first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware interface device from described first supervisory process.
Method the most according to claim 8, wherein controls data path resource and farther includes
Described supervisory process intercepts and imitates all memorizeies performed by described virtual machine and maps input/output.
Method the most according to claim 9, wherein simulation includes changing the slow of described virtual machine
Rush address.
11. methods according to claim 8, wherein control data path resource and include data
The queue of described hardware interface device it is mapped to from the queue of described virtual machine.
12. methods according to claim 8, wherein said supervisory process includes management program.
13. methods according to claim 8, wherein include described virtual machine instantiation many
Individual virtual machine instantiation.
14. 1 kinds of data handling systems for virtual machine, described data handling system includes:
Processor;
It is couple to the memorizer of described processor;And
Interface port, it is used for being couple to hardware network interface equipment separate with described processor, wherein
Described processor is used for:
Run the first supervisory process, described first supervisory process via described interface port virtual with described
Data are transmitted between bag queue in bag queue that machine is relevant and described hardware network interface equipment;And
Running the second program, described in described second programme-control, the configuration of hardware network interface equipment, described
Second program is separated with described first supervisory process;
Described first supervisory process launch between described virtual machine and described hardware network interface equipment and
Receive packet;And described transmitting packet includes:
In guest user territory, start system call, to send packet;
Described packet is switched to passenger plane kernel from described guest user territory context;
Described packet is switched to described first supervisory process from described passenger plane kernel context;And
Described packet is transmitted into described hardware network interface equipment from described first supervisory process.
15. data handling systems according to claim 14, wherein said processor is used further
In running described virtual machine.
16. data handling systems according to claim 14, wherein said first supervisory process is
Management program.
17. data handling systems according to claim 16, wherein said management program includes soon
Speed path data driver, it is couple to that memorizer maps input/output and imitates described virtual machine
Equipment.
18. data handling systems according to claim 14, wherein said hardware network interface sets
Standby is NIC.
19. data handling systems according to claim 14, it farther includes described hardware net
Network interface equipment.
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US38907110P | 2010-10-01 | 2010-10-01 | |
US61/389,071 | 2010-10-01 | ||
PCT/US2011/053731 WO2012044700A1 (en) | 2010-10-01 | 2011-09-28 | System and method for controlling the input/output of a virtualized network |
US13/247,578 | 2011-09-28 | ||
US13/247,578 US9213567B2 (en) | 2010-10-01 | 2011-09-28 | System and method for controlling the input/output of a virtualized network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103140830A CN103140830A (en) | 2013-06-05 |
CN103140830B true CN103140830B (en) | 2016-11-30 |
Family
ID=
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1455566A (en) * | 2002-12-20 | 2003-11-12 | 中国科学院沈阳自动化研究所 | On-the-spot bus scatter control station |
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1455566A (en) * | 2002-12-20 | 2003-11-12 | 中国科学院沈阳自动化研究所 | On-the-spot bus scatter control station |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8832688B2 (en) | Kernel bus system with a hyberbus and method therefor | |
US9213567B2 (en) | System and method for controlling the input/output of a virtualized network | |
Dong et al. | High performance network virtualization with SR-IOV | |
US7945436B2 (en) | Pass-through and emulation in a virtual machine environment | |
Dall et al. | ARM virtualization: performance and architectural implications | |
CN102609298B (en) | Based on network interface card virtualization system and the method thereof of hardware queue expansion | |
Nussbaum et al. | Linux-based virtualization for HPC clusters | |
US20120254862A1 (en) | Efficent migration of virtual functions to enable high availability and resource rebalance | |
US10852990B2 (en) | Hybrid framework of NVMe-based storage system in cloud computing environment | |
US9529615B2 (en) | Virtual device emulation via hypervisor shared memory | |
US20200150997A1 (en) | Windows live migration with transparent fail over linux kvm | |
WO2013091221A1 (en) | Enabling efficient nested virtualization | |
Shea et al. | Network interface virtualization: challenges and solutions | |
Mohebbi et al. | Zivm: A zero-copy inter-vm communication mechanism for cloud computing | |
Zhang et al. | High-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clusters | |
US10164911B2 (en) | Shim layer used with a virtual machine virtual NIC and a hardware platform physical NIC | |
US20210209040A1 (en) | Techniques for virtualizing pf-vf mailbox communication in sr-iov devices | |
CN117389694B (en) | Virtual storage IO performance improving method based on virtio-blk technology | |
Ma et al. | InfiniBand virtualization on KVM | |
CN114397999A (en) | Communication method, device and equipment based on nonvolatile memory interface-remote processing message transmission | |
Schroeder et al. | VISAGE: An object-oriented scientific visualization system | |
Chang et al. | Virtualization technology for TCP/IP offload engine | |
CN103140830B (en) | The system and method that the input/output of virtual network is controlled | |
Mouzakitis et al. | Lightweight and generic RDMA engine para-virtualization for the KVM hypervisor | |
WO2017026931A1 (en) | Implementing input/output in a virtualized environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |