US20150378770A1 - Virtual machine backup - Google Patents

Virtual machine backup Download PDF

Info

Publication number
US20150378770A1
US20150378770A1 US14/727,245 US201514727245A US2015378770A1 US 20150378770 A1 US20150378770 A1 US 20150378770A1 US 201514727245 A US201514727245 A US 201514727245A US 2015378770 A1 US2015378770 A1 US 2015378770A1
Authority
US
United States
Prior art keywords
cache
log
processor unit
virtual machine
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/727,245
Inventor
Guy L. Guthrie
Naresh Nayar
Geraint North
William J. Starke
Albert J. Van Norstrand, Jr.
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GlobalFoundries Inc
Original Assignee
GlobalFoundries Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GlobalFoundries Inc filed Critical GlobalFoundries Inc
Priority to US14/727,245 priority Critical patent/US20150378770A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NORTH, GERAINT, Guthrie, Guy L., STARKE, WILLIAM J., VAN NORSTRAND, ALBERT J., JR., NAYAR, NARESH
Publication of US20150378770A1 publication Critical patent/US20150378770A1/en
Assigned to GLOBALFOUNDRIES U.S. 2 LLC reassignment GLOBALFOUNDRIES U.S. 2 LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Assigned to GLOBALFOUNDRIES INC. reassignment GLOBALFOUNDRIES INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GLOBALFOUNDRIES U.S. 2 LLC, GLOBALFOUNDRIES U.S. INC.
Assigned to GLOBALFOUNDRIES U.S. INC. reassignment GLOBALFOUNDRIES U.S. INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: WILMINGTON TRUST, NATIONAL ASSOCIATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0804Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with main memory updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1405Saving, restoring, recovering or retrying at machine instruction level
    • G06F11/141Saving, restoring, recovering or retrying at machine instruction level for bus or memory accesses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • G06F11/1484Generic software techniques for error detection or fault masking by means of middleware or OS functionality involving virtual machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • G06F12/0833Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means in combination with broadcast means (e.g. for invalidation or updating)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0855Overlapped cache accessing, e.g. pipeline
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/128Replacement control using replacement algorithms adapted to multidimensional cache systems, e.g. set-associative, multicache, multiset or multilevel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2046Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share persistent storage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/86Event-based monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/15Use in a specific computing environment
    • G06F2212/151Emulated environment, e.g. virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements
    • G06F2212/621Coherency control relating to peripheral accessing, e.g. from DMA or I/O device

Definitions

  • Embodiments of the inventive subject matter generally relate to the field of virtual machines and, more particularly, to hypervisors supporting one or more virtual machines.
  • Virtualization is commonly applied on computer systems to improve the robustness of the implemented computing architecture to faults and to increase utilization of the resources of the architecture.
  • processor units for example processors and/or processor cores
  • VMs virtual machines
  • This concept is an important facilitator of a “high availability” service provided by such a VM.
  • a physical host responds to a failure of another physical host by simply rebooting the VM from a shared disk state, for example a shared image of the VM. This however increases the risk of disk corruption and the loss of the exposed state of the VM altogether.
  • all VM memory is periodically marked as read only to allow for changes to the VM memory to be replicated in a copy of the VM memory on another host.
  • a hypervisor is able to trap all writes that a VM makes to memory and maintain a map of pages that have been dirtied since the previous round.
  • the migration process atomically reads and resets this map, and the iterative migration process involves chasing dirty pages until progress can no longer be made.
  • This approach improves failover robustness because a separate up-to-date image of the VM memory is periodically created on a backup host that can simply launch a replica of the VM using this image following a hardware failure of the primary host.
  • a drawback of this approach is that as the VM remains operational during the read-only state of its VM memory, a large number of page faults can be generated.
  • this approach does not allow for the easy detection of which portion of a page has been altered, such that whole pages must be replicated even if only a single bit has been changed on the page, which is detrimental to the overall performance of the overall architecture, as for instance small page sizes have to be used to avoid excessive data traffic between systems, which reduces the performance of the operating system as the operating system is unable to use large size pages.
  • Another failover approach discloses a digital computer memory cache organization implementing efficient selective cache write-back, mapping and transferring of data for the purpose of roll-back and roll-forward of, for example, databases.
  • Write or store operations to cache lines tagged as logged are written through to a log block builder associated with the cache.
  • Non-logged store operations are handled local to the cache, as in a write-back cache.
  • the log block builder combines write operations into data blocks and transfers the data blocks to a log splitter.
  • a log splitter demultiplexes the logged data into separate streams based on address.
  • the above approaches are not without problems.
  • the cache is sensitive to page faults as the cache is put into a read-only state.
  • large amounts of data may have to be stored for each checkpoint, which causes pressure on the resource utilization of the computing architecture, in particular the data storage facilities of the architecture.
  • Embodiments generally include a method that includes indicating, in a log, updates to memory of a virtual machine when the updates are evicted from a cache of the virtual machine.
  • the method further includes determining a guard band for the log.
  • the guard band indicates a threshold amount of free space for the log.
  • the method further includes determining that the guard band will be or has been encroached upon corresponding to indicating an update in the log.
  • the method further includes updating a backup image of the virtual machine based, at least in part, on a set of one or more entries of the log.
  • the set of entries is sufficient to comply with the guard band.
  • the method further includes removing the set of entries from the log.
  • Embodiments include a computer system arranged to run a hypervisor running one or more virtual machines; a cache connected to the processor unit and comprising a plurality of cache rows, each cache row comprising a memory address, a cache line and an image modification flag; and a memory connected to the cache and arranged to store an image of at least one virtual machine; wherein: the processor unit is arranged to define a log in the memory; and the cache further comprises a cache controller arranged to: set the image modification flag for a cache line modified by a virtual machine being backed up; periodically check the image modification flags; and write only the memory address of the flagged cache rows in the defined log; and the processor unit is further arranged to monitor the free space available in the defined log and to trigger an interrupt if the free space available falls below a specific amount.
  • Embodiments generally include a method of operating a computer system comprising a processor unit arranged to run a hypervisor running one or more virtual machines; a cache connected to the processor unit and comprising a plurality of cache rows, each cache row comprising a memory address, a cache line and an image modification flag; and a memory connected to the cache and arranged to store an image of at least one virtual machine; the method comprising the steps of defining a log in the memory; setting the image modification flag for a cache line modified by a virtual machine being backed up; periodically checking the image modification flags; writing only the memory address of the flagged cache rows in the defined log; monitoring the free space available in the defined log, and triggering an interrupt if the free space available falls below a specific amount.
  • a hypervisor is arranged to host a VM as well as act as a VM image replication manager to create a replica of a VM image in another location, for example in the memory of another computer system.
  • the cache rows include an image modification flag that signal the modification of a cache line by the execution of the VM, and therefore, signal a change to the VM image. Including an image modification flag in the cache row allows the memory addresses of the dirty cache lines to be written to the log without requiring the expulsion of the dirty cache lines from the cache at the same time.
  • the use of an image modification flag ensures that the memory addresses of the modified cache lines can be written to the log without at the same time requiring the cache lines to be flushed from the cache, which reduces the amount of data that needs to be transferred from the cache when updating the log.
  • the image modification flag is only set if the change to a cache line is caused by a virtual machine operation that relates to a virtual machine being backed up. If the change to a cache line is caused by a virtual machine that has not been backed up or as the result of the hypervisor operating in privilege mode, then the image modification flag is not set. This reduces the amount of unnecessary data that is backed up at a checkpoint.
  • the log is a circular buffer that contains some unprocessed log entries.
  • the producer core writes new entries to the log, and registers indicate where the start and end of the unprocessed log entries are. When the log entries reach the end of the buffer, they wrap-around to the beginning. As the consumer core processes entries, the “unprocessed log entries start here” register is updated. If the consumer core is unable to process the entries with sufficient speed, the processor core's entries can collide with the unprocessed log entries and this is the point at which a re-sync or failover must occur.
  • a guard band is a space between the current location to which new logs are written and the start of the unprocessed entries.
  • the processor unit is arranged to monitor the free space available in the log and to trigger an interrupt if the free space available falls below a specific amount (a guard band). If the head of the log entries moves to within the guard band, an interrupt is triggered.
  • the size of the guard band may be static or dynamic in nature.
  • the guard band should be large enough to contain all the data that might be emitted as part of a checkpoint. This means that when an interrupt is delivered on entry to the guard band, execution of the producer core can be halted and a cache flush initiated. At this point, all of the required log entries are in the circular buffer, and the producer core can be resumed once the consumer core has processed enough log entries to clear the backlog. This avoids the need to do a full memory re-sync or failover in the event that the consumer core is unable to keep up with the producer core.
  • the specific amount of minimum free space available in the log comprises a predetermined amount derived from a sum of the write-back cache sizes, a component representing the number of instructions in the CPU pipeline that have been issued but not yet completed and a component representing the number of new instructions that will be issued in the time taken for an interrupt to be delivered to the processor core.
  • the processor unit is arranged to run multiple execution threads, in a technique commonly referred to as “Simultaneous Multithreading (SMT).”
  • SMT Simultaneous Multithreading
  • the hypervisor is arranged to maintain a thread mask, flagging those threads that relate to one or more virtual machines being backed up.
  • the hypervisor refers to the thread mask to determine whether to set the image modification flag for the current cache line being modified.
  • Each cache row further comprises a thread ID indicating which execution thread is responsible for modification of the cache line in the respective cache TOW.
  • a single bitfield register called a thread mask, is added to each processor unit, with a number of bits equal to the number of hardware threads supported by that unit, and hypervisor-privileged operations added to set those bits.
  • the hypervisor (which knows which virtual machines are running on which hardware threads) sets the associated bits in the thread mask for the hardware threads that are running virtual machines that require checkpoint-based high-availability protection.
  • a new field, thread ID is added alongside the image modification flag on every cache line. The thread ID field is sufficiently large to contain the ID of the hardware thread that issued the store operation (i.e., two bits if four hardware threads are supported).
  • the image modification flag is set in the cache, only if the store was not executed when running in the hypervisor privilege mode and if the thread mask bit corresponding to the currently executing hardware thread is set.
  • these store operations can also write the ID of the hardware thread that issued the store to the cache line's thread ID field.
  • the log record is directed to a different log based on the value of the thread ID, with the processor core capable of storing position and size information for multiple logs. When this alternative is used, it is not necessary to write the thread ID field to the log.
  • the above aspects allow multiple virtual machines to execute on a single processor unit concurrently, with any number of them running with checkpoint-based high-availability protection.
  • the presence of the thread ID in the logs, coupled with the hypervisor's record of which virtual machines are currently running on which processor cores and hardware threads, is sufficient to allow the secondary host (the memory location where the backup image is stored) to update the correct virtual machine memory image on receipt of the logs.
  • the cache controller typically is further adapted to write the memory address of a flagged cache line in the defined log upon the eviction of the flagged line from the cache. This captures flagged changes to the VM image that are no longer guaranteed to be present in the cache during the periodic inspection of the image modification tags.
  • the computer system is further arranged to update a backup image of the virtual machine in a different memory location by retrieving the memory addresses from the log; obtaining the modified cache lines using the retrieved memory addresses; and updating the further image with said modified cache lines.
  • the logged memory addresses are used to copy the altered data of the primary image to the copy of the VM image, which copy may for instance be located on another computer system.
  • VM images may be synchronized without incurring additional page faults and reduces the traffic between systems due to the smaller granularity of the data modification, i.e. cache line-size rather than page size. Due to the fact that the VM is suspended during image replication, no page protection is necessary. This approach is furthermore page size-agnostic such that various page sizes can be used. Moreover, the additional hardware cost to the computer system is minimal; only minor changes to the cache controller, for example to the cast-out engine and the snoop-intervention engine of the cache controller, and to the cache rows of the cache are required to ensure that the cache controller periodically writes the memory address of the dirty cache line in the log through periodic inspection of the image modification flag during execution of the VM.
  • the computer system may replicate data from the primary VM image to a copy in push or pull fashion.
  • a processor unit from the same computer system for example the processor unit running the VM or a different processor unit, may be also responsible, under control of the hypervisor, for updating the copy of the image of the VM in the different memory location, which may be a memory location in the memory of the same computer system or a memory location in the memory of a different computer system.
  • a processor unit of a different computer system may be adapted to update the copy of the VM image in the memory location on this different computer system by pulling the memory addresses and associated modified cache lines from the computer system hosting the VM.
  • the cache may include a write-back cache, which may form part of a multi-level cache further including a write-through cache adapted to write cache lines into the write-back cache, wherein only the cache rows of the write-back cache comprise the flag.
  • a write-back cache which may form part of a multi-level cache further including a write-through cache adapted to write cache lines into the write-back cache, wherein only the cache rows of the write-back cache comprise the flag.
  • the log which stores the addresses of changed cache lines is a circular buffer and the system comprises a plurality of registers adapted to store a first pointer to a wrap-around address of the circular buffer, a second pointer to the next available address of the circular buffer, a third pointer to an initial address of the circular buffer, and the size of the circular buffer.
  • the cache controller is adapted to update at least the second pointer following the writing of a memory address in the log.
  • Each processor unit is configured to deduplicate the memory addresses in the log prior to the retrieval of the addresses from the log. This reduces the amount of time required for synchronizing data between the memories by ensuring that the altered data in a logged memory location is copied once only. In this manner, the log is updated with the memory addresses of the modified cache lines without the need to flush the modified cache lines from the cache at the same time.
  • the processor unit typically further performs the step of writing the memory address of a flagged cache line in the defined log upon the eviction of said flagged line from the cache to capture flagged changes to the VM image that no longer are guaranteed to be present in the cache during the periodic inspection of the image modification tags.
  • FIG. 1 schematically depicts a computer system according to an embodiment of the present inventive subject matter
  • FIG. 2 schematically depicts an aspect of a computer system according to an embodiment of the present inventive subject matter in more detail
  • FIG. 3A schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail
  • FIG. 3B schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail
  • FIG. 4A schematically depicts a first portion of a flowchart of an aspect of a method of updating computer system according to an embodiment of the present inventive subject matter
  • FIG. 4B schematically depicts a second portion of a flowchart of an aspect of a method of updating computer system according to an embodiment of the present inventive subject matter
  • FIG. 5 schematically depicts a flowchart of another aspect of a method of updating computer system according to an embodiment of the present inventive subject matter
  • FIG. 6 schematically depicts a flowchart of another aspect of a method of updating computer system according to another embodiment of the present inventive subject matter
  • FIG. 7 schematically depicts a computer cluster according to an embodiment of the present inventive subject matter.
  • FIG. 8A schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail
  • FIG. 8B schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail.
  • FIG. 8C schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail.
  • FIG. 1 schematically depicts a computer system 100 .
  • the computer system 100 comprises a plurality of processor units 110 a - 110 d for hosting one or more virtual machines.
  • processor units 110 a - 110 d are shown by way of example; it should be understood that the computer system 100 may comprise any suitable number of processor units.
  • a processor unit is a unit of hardware that is capable of (pseudo-) autonomous execution of a computer program code, such as a processor, microprocessor or a core of a processor or microprocessor comprising a plurality of such cores.
  • Each processor unit 110 a - 110 d can be arranged to run a hypervisor, which is a software component that enables the provision of the virtual machine(s) to external users.
  • Each processor unit 110 a - 110 d further is connected to and has access to a cache 120 a - 120 d , which comprises a cache controller 122 a - 122 d in addition to a pool of entries 124 a - 124 d , with each entry including a cache line and one or more tags.
  • a cache 120 a - 120 d may reside in any suitable location. For instance, the cache 120 may be located on or in the vicinity of the processor unit 110 to reduce data retrieval latency.
  • each processor unit 110 a - 110 d has access to a dedicated cache 120 a - 120 d .
  • caches 120 a - 120 d are shown by way of example, one for each of the respective processor units 110 a - 110 d .
  • any suitable configuration may be chosen, for example a configuration in which a processor unit 110 a - 110 d has access to multiple caches 120 a - 120 d , which may be organized in a hierarchical structure, for example a combination of a level-1, level-2 and level-3 cache.
  • Each processor unit 110 a - 110 d is communicatively coupled to bus architecture 130 through its respective cache 120 a - 120 d , at least at a functional level. This means that any access of data by a processor unit 110 a - 110 d will involve its cache 120 a - 120 d . Any suitable bus architecture 130 may be chosen.
  • the computer system 100 further comprises a memory 140 coupled to the bus architecture 130 , which again may take any suitable form, for example a memory integrated in the computer system or a distributed memory accessible over a network.
  • the memory 140 is connected to the caches 120 a - 120 d .
  • the memory 140 may be volatile memory or non-volatile memory. Many other suitable embodiments of a memory 140 are possible.
  • the computer system 100 may comprise additional components such as one or more network interfaces, input ports, and output ports.
  • the computer system 100 is adapted to host one or more virtual machines on the processor units 110 a - 110 d , through the use of a hypervisor.
  • a VM is a software representation of a computing device capable of hosting anything from a single computer program to a complete operating system, and which may be present itself as a separate system to the user of the computer system 100 , such that the user has no awareness of the underlying computer system 100 .
  • the computer system 100 embodying a local area network (LAN) server having a plurality of processors each comprising a number of cores
  • LAN local area network
  • One of the attractions of virtualization is improved robustness due to the ability to provide failover between VMs, which means that should a VM fail, a backup VM is available that will continue to provide the VM functionality to the user.
  • a copy of a VM is periodically updated so that the copy represents the actual current state of the original VM in case the original VM exhibits a failure and will have to failover to the copy VM.
  • the original VM will be referred to as the primary VM and its copy will be referred to as the secondary VM.
  • Such synchronization between the primary VM and the secondary VM requires the temporary suspension of the primary VM so that its state does not change during the synchronization.
  • the duration of such suspension should be minimized such that the one or more users of the VM are not noticeably affected by the temporary suspension.
  • differential checkpoints which capture changes in the state of an entity since the last checkpoint are created.
  • Such checkpoints may be generated by writing the address and data from a cache line to a secondary memory such as a level-2 cache or the system memory 140 as soon as the data in a cache line is altered.
  • a large amount of data may be unnecessarily communicated during operation of the primary VM. For instance, if a cache line of the cache 120 a - 120 d used by the primary VM is updated multiple times during the operation mode of the primary VM, previous versions of the data in the cache line are unnecessarily written to the secondary memory as this ‘old’ data has become redundant.
  • the data storage part 124 a - 124 d comprises a plurality of cache rows 1210 , with each cache row 1210 including a tag 1212 which is the address of the data in memory 140 , a cache line 1214 and a number of flag bits.
  • the flag bits comprise a valid bit 1215 , which signals if the cache line 1214 is still relevant to the processor unit 110 a - 110 d , a dirty bit 1216 , which signals if the cache line 1214 has been altered such that it needs writing back to the address in memory 140 stored in the tag 1212 , an image modification flag 1217 and a thread ID field 1218 , which are described in more detail below.
  • the cache rows 1210 of a cache 120 a - 120 d capable of containing dirty cache lines 1214 include the VM image modification bit flag 1217 that signals whether the cache line 1214 is modified by a processor unit 110 a - 110 d executing a VM that is being backed up. In other words, this flag signals if the modified cache line 1214 forms part of a VM image for which a checkpoint based backup is operating.
  • the cache controller 122 will set both the dirty bit flag 1216 and the VM image modification flag 1217 to true upon a write access of the cache line 1214 by the processor unit 110 a - 110 d during the execution of a VM that is being backed up. The purpose of this will be explained in more detail below.
  • the processor unit 110 a - 110 d hosting a primary VM may include a replication manager, which may be included in the design of the hypervisor, and/or which may be realized in hardware, in software, or a combination of hardware and software.
  • the replication manager is adapted to create a log in the system memory 140 for logging the memory addresses of the cache lines 1214 modified during the execution of the VM.
  • the data in the log is only accessible to the replication manager of a processor unit including other processor units 110 a - 110 d of the computer system 100 or processor units 110 a - 110 d of another computer system 100 as will be explained in more detail later.
  • the memory address log in the memory 140 has a defined size and allocation to avoid corruption of the memory 140 .
  • Any suitable implementation of such a log may be chosen.
  • a particularly suitable implementation is shown in FIG. 3 a .
  • the log is defined as a circular buffer 200 in the system memory 140 , and has a size 202 defined by the replication manager, which is preferably part of the hypervisor of the processor unit 110 a - 110 d .
  • the log 200 is designed to comprise a plurality of memory addresses in memory locations 204 .
  • a portion 206 is shown to indicate unused memory locations in the log 200 , which comprises the free space available in the defined log 200 .
  • the computer system 100 includes a set of registers 210 including a first register 212 in which the base address of the circular buffer 200 is stored, a second register 214 in which the next available address of the circular buffer is stored, a third register 216 in which the starting point of the circular buffer 200 is stored and a fourth register 218 in which the size 202 of the circular buffer 200 is stored.
  • the set of registers 210 are located on the respective processor unit 110 a - 110 d . In some implementations, the set of registers 210 may form part of the cache controller 122 .
  • the registers 210 also include a thread mask 220 , which contains a flag for each thread being executed by the respective processor unit 110 a - 110 d .
  • the thread mask 220 indicates those threads that relate to a virtual machine that is being backed up.
  • the replication manager of the processor element 110 a - 110 d will populate the registers 212 , 214 , 216 and 218 and the thread mask 220 with the appropriate values after which execution of the VM(s) on the processor unit 110 a - 110 d may start or resume.
  • the hardware architecture of the cache controller 122 is adapted to traverse the cache 120 a - 120 d , inspect the VM image modification bit flags 1217 , write the memory addresses of the cache lines 1214 and the thread ID 1218 to the log 200 of the cache lines 1214 that have a VM image modification flag 1217 set to true, and to clear the VM modifications flags 1217 once the corresponding memory addresses have been written to the log 200 .
  • the cache controller 122 performs these operations upon the temporary suspension of a VM by the hypervisor of its processor unit 110 a - 110 d to facilitate the replication of the VM image and in response to a signal from the processor unit 110 a - 110 d requesting that the memory addresses in the tags 1212 of the modified cache lines 1214 should be made available for replication of the VM image.
  • FIG. 3 a shows an arrangement of registers 210 for a processor unit 110 a - 110 d that supports four hardware threads in which log records are emitted to a single log 200 , with each record being tagged with the thread ID 1218 .
  • the per-hardware-thread processor privilege register which indicates whether a hardware thread is running in hypervisor mode or not, is not shown. Since the address 204 stored in the log 200 is the address of a cache line, any given cache line address can be represented in 64 bits with the least-significant bits spare to contain the thread ID, so a log record can be wholly contained within 64 bits. As described above, cast-outs, snoop interventions and cache clean operations will emit all cache lines with the image modification flag 1217 set to the in-memory log, with the log 200 containing the thread ID and address of the entry.
  • Cache lines that fit the criteria (backup enabled for the hardware thread, and not running in hypervisor privileged mode) will be marked in the cache with the image modification flag 1217 set and the thread ID indicated, and on cast-out, snoop intervention or cache clean will be written out to one of four logs, with the destination in memory identified by first examining the thread ID associated with that cache line, and then writing the cache line address to the address specified by the producer head register of the appropriate hardware thread.
  • any change to the hardware thread-to-VM assignment would require a cache-clean operation to ensure that any image modification flag data for the virtual machine that was previously running on the hardware thread had been pushed out to the log 200 prior to the switch taking place, and the hypervisor should note at which point in the log the virtual machine switched from one to another, so that the processor unit 110 a - 110 d is able to communicate these memory changes to the secondary host in terms of the virtual machine that has undergone modification, rather than the hardware thread that caused the modification.
  • the cache clean operation could be extended to only target specific thread IDs, allowing the operation to selectively clean only the cache lines associated with hardware threads that are being reassigned to another virtual machine. This would reduce the number of unnecessary log entries that were produced if, for example, three hardware threads were running code for virtual machine 0 , and a fourth running code for virtual machine 1 .
  • a reassignment to have the fourth hardware thread run code for virtual machine 2 only requires that cache lines associated with the fourth hardware thread been written to the in-memory buffer before it can start executing code for virtual machine 2 .
  • FIG. 4 shows a flowchart of an example embodiment of such an updating method.
  • the replication manager creates the log 200 in the system memory 140 in step 410 and stores the relevant values of the base address, initial address (starting point), next available address and log size in the registers 212 , 214 , 216 and 218 , as previously explained.
  • the thread mask 220 is also populated, indicating which threads being executed by the processor unit 110 a - 110 d relate to virtual machines being backed up.
  • the cache controller 122 subsequently monitors and handles in step 420 accesses to the cache lines in the line memory 124 a - 124 d of the cache 120 a - 120 d by the processor unit 110 a - 110 d (or any other processor unit).
  • the cache controller 122 performs a number of checks in step 420 , which checks have been identified in FIG. 4 as steps 420 ′, 420 ′′ and 420 ′′ respectively.
  • step 420 ′ the cache controller 122 checks if the cache line access has caused a modification of the accessed cache line, in which case the cache controller set the flag 1216 signalling the cache line as being dirty, as is well-known per se.
  • the method proceeds from step 420 ′ to step 425 , in which the cache controller 122 further checks if such a dirty cache line has been generated during the execution of a VM that is being backed up, via reference to the thread mask 220 .
  • the cache controller 122 also sets the VM image modification flag 1217 signalling the cache line as being a dirty cache line belonging to a VM image to be backed up in step 430 before returning to step 420 . Any hypervisor actions in privilege mode also do not result in the image modification flag 1217 being set.
  • step 420 ′′ If the cache access does not lead to the modification of a cache line but instead causes the eviction of a cache line from the cache 120 a - 120 d , as checked in step 420 ′′, the method proceeds from step 420 ′′ to step 435 in which the cache controller 122 checks if a cache line to be evicted from the cache 120 a - 120 d is flagged as being modified by the VM, i.e. checks if the VM image modification flag 1217 of the cache line to be evicted is set to true.
  • the cache controller 122 for example using the cast-out engine or the snoop-intervention engine, writes the memory address of the evicted cache line to the log 200 in step 440 so that this modification is captured in the log 200 , after which the method returns to step 420 .
  • step 420 ′′′ if the cache access request is a request to generate a VM checkpoint.
  • a request may originate from the replication manager of the processor unit 110 a - 110 d hosting the VM, or may originate from a replication manager of another processor unit responsible for replicating the changes to the primary VM image during the execution of the VM in a secondary VM image.
  • Step 420 ′′′ occurs periodically, at regular intervals such as every 25 ms, so that the secondary VM image is regularly updated. Any suitable checkpoint generation frequency may be chosen.
  • the checks 420 ′, 420 ′′ and 420 ′′′ are shown as a sequence of steps for the sake of clarity only. It should be understood that the cache controller 122 does not have to perform each of these checks to decide what cause of action should be taken next. It is for instance equally feasible that the cache controller 122 may immediately recognize that a cache line eviction or a VM image replication is required, in which case the cache controller 122 may proceed from step 420 directly to step 435 or step 460 respectively.
  • the cache controller 122 Upon detecting the checkpoint generation instruction in step 420 ′′′, the cache controller 122 traverses the cache 120 a - 120 d and inspects in step 460 the VM image modification flag 1217 of all cache rows 1210 that comprise such a flag. Upon detection of a VM image modification flag 1217 set to true, the cache controller retrieves the memory address of the associated cache line 1214 from tag 1212 and writes the retrieved memory address into the log 200 in step 470 . To this end, the cache controller 122 retrieves the pointer of the next available address in the log 200 from the register 214 , for example by fetching this pointer or requesting this pointer from the replication manager of the processor unit 110 a - 110 d.
  • this updating step comprises moving the pointer forward by offsetting the pointer presently stored in the register 214 with the size of the stored memory address and writing this offset value in the register 214 .
  • the cache controller 122 or the replication manager of the processor unit 110 a - 110 d will check if the next available address equals the base address plus the size of the log 200 as this indicates that the boundary of the address range of the log 200 in the system memory 140 has been reached. If this is the case, the cache controller 122 or the replication manager of the processor unit 110 a - 110 d will set, i.e. wrap around, the next available address to the base address.
  • Step 480 may be executed at any suitable point in time, for example after each write action to the log 200 , or after all write actions to the log 200 have been completed.
  • any suitable cache architecture may be used for the cache 120 a - 120 d .
  • Such architectures may include different types of caches, such as a combination of a write-through cache and one or more write-back caches.
  • a write-through cache retains data in the cache and at the same time, synchronously, pushes the data into a next level of the cache. This provides fast access times for subsequent reads of the cache lines 1214 by the processor unit 110 a - 110 d at the cost of slower write actions, as the writer has to wait for the acknowledgement that the write action has been completed in the (slower) next level cache.
  • a write-through cache does not contain dirty cache lines, as the cache lines are ‘cleaned up’ in one of the next level caches.
  • the VM image modification flags 1217 may be omitted from the write-through cache and may be added to only those caches that can contain dirty cache lines, that is the write-back caches that do not push modified cache lines to a next level cache but are responsible for managing data coherency between caches and memory 140 as a consequence.
  • Step 460 is typically applied to all caches in the cache architecture that have cache rows 1210 containing the VM image modification flag 1217 , therefore all write-back caches.
  • the replication manager may trigger the replication of the VM image in memory 140 to another memory location, such as another memory or cache, by accessing the log 200 , fetching the addresses stored in the log 200 , fetching the cache lines stored at the fetched addresses and updating a copy of the VM image in the other memory location with the fetched cache lines, as previously explained.
  • the replication manager triggering the flush of the cache line addresses and the subsequent update of the secondary image of the VM does not have to be the replication manager of the processor unit 110 a - 110 d running the VM.
  • the replication manager of another processor unit 110 a - 110 d of the computer system 100 may be in charge of this update process.
  • modified cache lines may be pulled from their primary memory location by a processor unit on a separate computer system, such as a processor unit responsible for hosting a secondary version of the VM, i.e. a processor unit to which the VM fails over, for example in case of a hardware failure of the processor unit hosting the primary VM.
  • the processor unit 110 a - 110 d hosting the VM forwards data relevant to the replication of its VM image in memory 140 including the values stored in the registers 212 , 214 , 216 and 218 to the replication manager of another processor unit, for example another processor unit in a different computer system, to allow this further replication manager to retrieve the altered cache lines using the addresses in the log 200 , as will be explained in more detail later.
  • the method may further comprise the optional step of deduplicating addresses in the log 200 to remove multiple instances of the same address in the log 200 . This for instance can occur if the frequency at which memory addresses are written to the log 200 is higher than the frequency at which the memory addresses in the log 200 are used to update a secondary VM image.
  • a primary VM is hosted by a single processor unit 110 a - 110 d . It is emphasized that this is by way of non-limiting example only. It is for instance equally feasible that a VM is hosted by several processor units 110 a - 110 d , for example several microprocessor cores, in which case several logs 200 (one for each core) may be maintained that track different modifications to the VM image in memory 140 .
  • the optional deduplication step may for instance be performed over all logs 200 such that a memory address occurs only once in the combined logs 200 to reduce the amount of data that needs to be copied to the secondary VM during a differential checkpoint generation.
  • the checkpoint generation may further require synchronization of other relevant states between the primary and secondary VMs, for example the state of the CPU, I/O involving disk(s) and network and so on.
  • the flowchart of FIG. 4 describes an example embodiment of a first operating mode of a processor unit 110 a - 110 d , which may be referred to as a producer mode in which the processor unit 110 a - 110 d produces the relevant data required for the replication of the image of the VM in the memory 140 to a copy of this image in, for example, the memory of another computer system.
  • a processor unit 110 a - 110 d can also operate in a second operating mode, in which it does not host a VM but is instead responsible for replicating the image of a primary VM.
  • This second operating mode may be referred to as a consumer mode, as a processor unit 110 a - 110 d in this mode is adapted to consume the modified cache lines in the VM image produced by a processor unit 110 a - 110 d executing the VM in its first operation mode or producer mode.
  • a further processor unit 110 a - 110 d of the computer system 100 including the processor unit 110 a - 110 d hosting the VM may be responsible for updating a replica of the VM image in a further location, for example, a memory of another computer system.
  • the processor unit 110 a - 110 d hosting the VM may switch between operating modes to assume responsibility for updating this replica.
  • a processor unit of another computer system for example the computer system on which the replica is stored, is responsible for updating this replica of the VM image.
  • the update of the VM image replica ensures that a processor unit 110 a - 110 d of a computer system 100 storing the replica in its memory can take over execution of the VM upon a hardware failure in the computer system 100 hosting the primary VM, leading to the termination of the execution of the primary VM on this system.
  • the second operating mode is not a separate operating mode but forms part of the first operating mode, in which case the processor unit 110 a - 110 d responsible for the execution of the primary VM also is responsible for updating the replica of the VM in the further memory location.
  • processor units 110 a - 110 d may be in producer mode (i.e. VM hosting mode) whilst other processor units 110 a - 110 d are in consumer mode (i.e. in VM image replication mode).
  • a single computer system in such a cluster may comprise processor units 110 a - 110 d in producer mode as well as in consumer mode, as previously explained.
  • the replication manager may control whether a processor unit 110 a - 110 d is in producer mode or consumer mode, for example by setting a hardware flag for the processor unit 110 a - 110 d such that it can be recognized in which mode a processor unit 110 a - 110 d is operating.
  • FIG. 5 depicts a flowchart of the method steps performed during such a second operating mode of a processor unit 110 a - 110 d .
  • a processor unit 110 a - 110 d for example the replication manager of the processor unit 110 a - 110 d , receives the relevant information from the replication manager of the processor unit 110 a - 110 d in producer mode, such as the contents of the registers 212 , 214 , 216 and 218 that will allow the replication manager of the consumer processor unit 110 a - 110 d to access the memory 140 of the computer system 100 including the producer processor unit 110 a - 110 d .
  • the replication manager of the producer processor unit 110 a - 110 d may volunteer the relevant information or may provide the relevant information upon a request thereto by the replication manager of the consumer processor unit 110 a - 110 d .
  • the processor unit 110 a - 110 d hosting the VM also acts as the processor unit responsible for updating the secondary VM image, the above step may be omitted.
  • the consumer processor unit 110 a - 110 d Upon retrieving the relevant information, the consumer processor unit 110 a - 110 d retrieves the memory addresses stored in the log 200 created by the replication manager of the producer processor unit 110 a - 110 d hosting the primary VM in step 510 and obtains the modified cache lines identified by the memory addresses in step 520 . To this end, the consumer processor unit may send a data retrieval request over the bus architecture 130 .
  • Such requests are noticed by the cache controllers 122 of the computer system 100 , for example by the snoop-intervention engines of the cache controllers 122 , which will fetch the cache line 1214 from the cache 120 a - 120 d if the memory address in the data retrieval request matches a memory address in one of the tags 1212 of the cache rows 1210 of the cache 120 a - 120 d .
  • the requesting processor unit 110 a - 110 d will typically await the response from a cache controller 122 of a further processor unit 110 a - 110 d for a defined period of time, after which the cache controller 122 of the requesting processor unit 110 a - 110 d will fetch the cache line from the memory 140 , as a non-response from the other cache controllers 122 will mean that the cache line 1214 no longer resides in cache but has been cast from the cache 120 a - 120 d instead.
  • the handling of such data retrieval requests in a computer system 100 comprising multiple processor units 110 a - 110 d and caches 120 a - 120 d may be accomplished using any suitable data retrieval protocol.
  • the consumer processor unit 110 a - 110 d subsequently updates the copy of the VM image accordingly in step 530 by inserting the obtained modified cache line 1214 in the appropriate location of the VM image copy. This process is repeated until all addresses have been retrieved from the log 200 as checked in step 540 , after which other state registers, if any, for example state registers of the CPU as previously explained, may be replicated as shown in step 550 .
  • the consumer processor unit 110 a - 110 d may signal the producer processor unit 110 a - 110 d hosting the primary VM that replication is complete, upon which the producer processor unit 110 a - 110 d hosting the primary VM, for example its hypervisor, will terminate the suspension of the primary VM and reinitialize the log 200 , resetting one or more of the registers 212 , 214 and 216 in the cache management module 122 .
  • the consumer processor unit 110 a - 110 d may have permission to deduplicate the addresses in the log 200 of the producer processor unit 110 a - 110 d hosting the primary VM prior to retrieving the memory addresses from the log 200 in step 510 .
  • a processor unit 110 a - 110 d in the second operating mode is adapted to speculatively process the log 200 of a processor unit 110 a - 110 d in the first operating mode, i.e. producer mode.
  • This implementation is for instance useful when the consumer processor unit does not trigger the cache controller 122 of the producer processor unit to write the modified cache line addresses to the log 200 , for example in case the producer processor unit hosting the VM periodically triggers the update of the log 200 .
  • FIG. 6 An example flowchart of this implementation is shown in FIG. 6 .
  • the consumer processor unit 110 a - 110 d retrieves a memory address from the log 200 of the processor unit 110 a - 110 d hosting the primary VM, retrieves the data from the memory 140 in the computer system 100 of the producer processor unit 110 a - 110 d and updates the secondary VM image as previously explained.
  • the consumer processor unit 110 a - 110 d invokes the update of the initial address value of the log 200 as stored in register 216 associated with the producer processor unit 110 a - 110 d hosting the primary VM. This may be achieved in any suitable way, for example by providing the replication manager of the consumer processor unit 110 a - 110 d with write privileges to update this register or by the consumer processor unit 110 a - 110 d instructing the replication manager of the producer processor element 110 a - 110 d to update this register value accordingly.
  • Step 610 ensures that the available space in the log 200 of the processor unit 110 a - 110 d hosting the primary VM is kept up-to-date, as the addresses already retrieved by the consumer processor unit 110 a - 110 d may be overwritten, as indicated by the change in the initial address stored in the register 216 associated with the producer processor unit 110 a - 110 d hosting the primary VM to the first address in the log 200 not yet processed by the consumer processor unit 110 a - 110 d .
  • the method may proceed to step 550 as previously explained in the detailed description of FIG. 5 .
  • step 610 may be omitted from the process of FIG. 6 , as it is no longer necessary to update the initial address value of the log 200 as stored in register 216 associated with the producer processor unit 110 a - 110 d hosting the primary VM, as no further addresses will be written to the log 200 and the log 200 will be re-initialized prior to the reactivation of the primary VM.
  • FIG. 7 schematically depicts a computer cluster 700 that comprises a plurality of computer systems 100 , which are communicatively coupled to each other via a network 720 .
  • the network 720 may be any suitable data communication network, for example a wired or wireless local area network, a wireless or wired wide area network, the Internet and so on.
  • the computer cluster 700 is typically adapted to host a plurality of virtual machines on the processor units 110 a - 110 d of the various computer systems 100 to be utilized by the users of the computer cluster 700 .
  • the computer cluster 700 benefits from the VM replication principles described above in that multiple up-to-date or mirror images of a VM may be generated in the respective memories 140 of at least some of the various computer systems 100 such that rapid VM failover can be provided with little overhead.
  • the above description describes modifying the cache hardware so that at regular intervals the circular buffer 200 in memory contains a list of all memory locations that have been modified by a given processor core since the last checkpoint. This is achieved by modifications to the cast-out engine and snoop-intervention engine in order to store in the log 200 memory addresses leaving the cache between checkpoints, and at a checkpoint there is initiated a cache flush to ensure that no modified data remains in the cache (thereby ensuring that dirty cache lines pass through the cast-out engine and thus are logged). If the circular buffer 200 becomes full, a full re-sync of memory must occur, or an immediate failover to the secondary system. This problem is addressed by ensuring that there is always sufficient space in the circular buffer 200 to accept any dirty data in the cache.
  • the circular buffer 200 contains some unprocessed log entries
  • the producer core writes new entries to the log 200
  • the registers indicate the location of the start and end of the unprocessed log entries.
  • the log entries reach the end of the buffer, they wrap-around to the beginning.
  • the “Unprocessed log entries start here” register is updated, as shown in FIG. 8 b . If the consumer core is unable to process the entries with sufficient speed, the processor core's entries can collide with the unprocessed log entries and this is the point at which a re-sync or failover must occur.
  • guard band which is the available space between the current location to which new logs are written, and the start of the unprocessed entries, which is shown in FIG. 8 c . If the head of the log entries moves to within the guard band, an interrupt is triggered.
  • the size of the guard band may be static or dynamic in nature.
  • the guard band is large enough to contain all the data that might be emitted as part of a checkpoint. This means that when an interrupt is delivered on entry to the guard band, execution of the producer core can be halted and a cache flush initiated. At this point, all of the required log entries are in the circular buffer, and the producer core can be resumed once the consumer core has processed enough log entries to clear the backlog. This avoids the need to do a full memory re-sync or failover in the event that the consumer core is unable to keep up with the producer core.
  • the guard band can be sized statically based on the worst-case possibility that, at the point where the guard band is reached it is assumed that all logged caches are full of dirty data, all instructions in the CPU pipeline that have been issued but have not yet completed are “store”-type instructions, and each of them will push out a dirty cache line (and thus emit a log entry) and create a new dirty cache line and in the time it takes for the interrupt to be delivered from the consumer cache to the consumer core, a certain number of new instructions will be issued, and each of those instructions are “store”-type operations, and each will push out a dirty cache line (and thus emit a log entry) and create a new dirty cache line.
  • the required guard band size is:
  • the cache-size related elements can be computed based on the number of dirty cache lines currently in the cache, rather than the worst-case number. This means that the size of the guard band can vary dynamically based on the number of log entries that would be emitted during the cache flush operation at the checkpoint. This is trivial to maintain within the cache, which is responsible for both tracking the fullness of the cache and also ensuring that the guard band is not reached. The PIPELINE and INTERRUPT portions of the calculation would remain constant.
  • a computer system is to be interpreted as a device that includes a collection of processor elements that can be utilized in unison. This does not necessarily equate to a single physical entity; it is equally feasible that a computer system is distributed over several physical entities, for example different boxes, or that a single physical entity includes more than one computer systems, for example several separate groups of processor units.
  • aspects of the present inventive subject matter may be embodied as a system, method or computer program product. Accordingly, aspects of the present inventive subject matter may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present inventive subject matter may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present inventive subject matter may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A virtual machine backup method includes utilizing a log to indicate updates to memory of a virtual machine when the updates are evicted from a cache of the virtual machine. A guard band is determined that indicates a threshold amount of free space for the log. A determination is made that the guard band will be or has been encroached upon corresponding to indicating an update in the log. A backup image of the virtual machine is updated based, at least in part, on a set of one or more entries of the log, wherein the set of entries is sufficient to comply with the guard band. The set of entries is removed from the log.

Description

    RELATED APPLICATIONS
  • This application is a Continuation of and claims the priority benefit of U.S. application Ser. No. 14/548,624 filed Nov. 20, 2014 which claims the priority under 35 U.S.C. §119 from United Kingdom Patent Application No. 1320537.2 filed on Nov. 21, 2013, which is incorporated by reference in its entirety.
  • FIELD OF THE INVENTION
  • Embodiments of the inventive subject matter generally relate to the field of virtual machines and, more particularly, to hypervisors supporting one or more virtual machines.
  • BACKGROUND
  • Virtualization is commonly applied on computer systems to improve the robustness of the implemented computing architecture to faults and to increase utilization of the resources of the architecture. In a virtualized architecture, one or more processor units, for example processors and/or processor cores, of the computer system act as the physical hosts of virtual machines (VMs), which are seen by the outside world as independent entities. This facilitates robustness of the architecture to hardware failures, as upon a hardware failure, a VM previously hosted by the failed hardware may be passed to another host, without the user of the virtual machine becoming aware of the hardware failure. This concept is an important facilitator of a “high availability” service provided by such a VM.
  • Implementing a switch between two different hardware resources as a result of a failure is not a trivial task, as the VM ideally should be relaunched in a state that is identical to the state of the VM at the point of the hardware failure, in order to avoid inconvenience to the current user of the VM. In one approach, this is provided by running multiple copies of a single VM in lock-step on different entities, for example on different physical servers, such that upon the failure of one entity another entity can take over the responsibility for hosting the VM. A significant drawback of such lock-step arrangements is that processing resources are consumed by a failover copy of a VM, thus reducing the available bandwidth of the system, therefore reducing the total number of VMs that can be hosted by a system. In another approach, a physical host responds to a failure of another physical host by simply rebooting the VM from a shared disk state, for example a shared image of the VM. This however increases the risk of disk corruption and the loss of the exposed state of the VM altogether.
  • In a different failover approach, all VM memory is periodically marked as read only to allow for changes to the VM memory to be replicated in a copy of the VM memory on another host. In this read-only state, a hypervisor is able to trap all writes that a VM makes to memory and maintain a map of pages that have been dirtied since the previous round. Each round, the migration process atomically reads and resets this map, and the iterative migration process involves chasing dirty pages until progress can no longer be made. This approach improves failover robustness because a separate up-to-date image of the VM memory is periodically created on a backup host that can simply launch a replica of the VM using this image following a hardware failure of the primary host.
  • However, a drawback of this approach is that as the VM remains operational during the read-only state of its VM memory, a large number of page faults can be generated. In addition, this approach does not allow for the easy detection of which portion of a page has been altered, such that whole pages must be replicated even if only a single bit has been changed on the page, which is detrimental to the overall performance of the overall architecture, as for instance small page sizes have to be used to avoid excessive data traffic between systems, which reduces the performance of the operating system as the operating system is unable to use large size pages.
  • Another failover approach discloses a digital computer memory cache organization implementing efficient selective cache write-back, mapping and transferring of data for the purpose of roll-back and roll-forward of, for example, databases. Write or store operations to cache lines tagged as logged are written through to a log block builder associated with the cache. Non-logged store operations are handled local to the cache, as in a write-back cache. The log block builder combines write operations into data blocks and transfers the data blocks to a log splitter. A log splitter demultiplexes the logged data into separate streams based on address.
  • In short, the above approaches are not without problems. For instance, during suspension of the VM, the cache is sensitive to page faults as the cache is put into a read-only state. Furthermore, large amounts of data may have to be stored for each checkpoint, which causes pressure on the resource utilization of the computing architecture, in particular the data storage facilities of the architecture.
  • BRIEF SUMMARY OF THE INVENTION
  • Embodiments generally include a method that includes indicating, in a log, updates to memory of a virtual machine when the updates are evicted from a cache of the virtual machine. The method further includes determining a guard band for the log. The guard band indicates a threshold amount of free space for the log. The method further includes determining that the guard band will be or has been encroached upon corresponding to indicating an update in the log. The method further includes updating a backup image of the virtual machine based, at least in part, on a set of one or more entries of the log. The set of entries is sufficient to comply with the guard band. The method further includes removing the set of entries from the log.
  • Embodiments include a computer system arranged to run a hypervisor running one or more virtual machines; a cache connected to the processor unit and comprising a plurality of cache rows, each cache row comprising a memory address, a cache line and an image modification flag; and a memory connected to the cache and arranged to store an image of at least one virtual machine; wherein: the processor unit is arranged to define a log in the memory; and the cache further comprises a cache controller arranged to: set the image modification flag for a cache line modified by a virtual machine being backed up; periodically check the image modification flags; and write only the memory address of the flagged cache rows in the defined log; and the processor unit is further arranged to monitor the free space available in the defined log and to trigger an interrupt if the free space available falls below a specific amount.
  • Embodiments generally include a method of operating a computer system comprising a processor unit arranged to run a hypervisor running one or more virtual machines; a cache connected to the processor unit and comprising a plurality of cache rows, each cache row comprising a memory address, a cache line and an image modification flag; and a memory connected to the cache and arranged to store an image of at least one virtual machine; the method comprising the steps of defining a log in the memory; setting the image modification flag for a cache line modified by a virtual machine being backed up; periodically checking the image modification flags; writing only the memory address of the flagged cache rows in the defined log; monitoring the free space available in the defined log, and triggering an interrupt if the free space available falls below a specific amount.
  • In some embodiments, a hypervisor is arranged to host a VM as well as act as a VM image replication manager to create a replica of a VM image in another location, for example in the memory of another computer system. As all changes made to an image of an active VM by the processor unit hosting the VM will travel through its cache, it is possible to simply log the memory address associated with a dirty cache line. To this end, the cache rows include an image modification flag that signal the modification of a cache line by the execution of the VM, and therefore, signal a change to the VM image. Including an image modification flag in the cache row allows the memory addresses of the dirty cache lines to be written to the log without requiring the expulsion of the dirty cache lines from the cache at the same time.
  • Hence, the use of an image modification flag ensures that the memory addresses of the modified cache lines can be written to the log without at the same time requiring the cache lines to be flushed from the cache, which reduces the amount of data that needs to be transferred from the cache when updating the log. However, the image modification flag is only set if the change to a cache line is caused by a virtual machine operation that relates to a virtual machine being backed up. If the change to a cache line is caused by a virtual machine that has not been backed up or as the result of the hypervisor operating in privilege mode, then the image modification flag is not set. This reduces the amount of unnecessary data that is backed up at a checkpoint.
  • The log is a circular buffer that contains some unprocessed log entries. The producer core writes new entries to the log, and registers indicate where the start and end of the unprocessed log entries are. When the log entries reach the end of the buffer, they wrap-around to the beginning. As the consumer core processes entries, the “unprocessed log entries start here” register is updated. If the consumer core is unable to process the entries with sufficient speed, the processor core's entries can collide with the unprocessed log entries and this is the point at which a re-sync or failover must occur. A guard band is a space between the current location to which new logs are written and the start of the unprocessed entries. The processor unit is arranged to monitor the free space available in the log and to trigger an interrupt if the free space available falls below a specific amount (a guard band). If the head of the log entries moves to within the guard band, an interrupt is triggered. The size of the guard band may be static or dynamic in nature. The guard band should be large enough to contain all the data that might be emitted as part of a checkpoint. This means that when an interrupt is delivered on entry to the guard band, execution of the producer core can be halted and a cache flush initiated. At this point, all of the required log entries are in the circular buffer, and the producer core can be resumed once the consumer core has processed enough log entries to clear the backlog. This avoids the need to do a full memory re-sync or failover in the event that the consumer core is unable to keep up with the producer core.
  • The specific amount of minimum free space available in the log (the guard band which triggers the interrupt) comprises a predetermined amount derived from a sum of the write-back cache sizes, a component representing the number of instructions in the CPU pipeline that have been issued but not yet completed and a component representing the number of new instructions that will be issued in the time taken for an interrupt to be delivered to the processor core. This ensures that the space in the log is large enough to hold the worst-case scenario, which is essentially that all existing cache-lines are dirty, all pending instructions will create new dirty cache lines and all new instructions created while the interrupt is being delivered will also create new dirty cache lines.
  • The processor unit is arranged to run multiple execution threads, in a technique commonly referred to as “Simultaneous Multithreading (SMT).” The hypervisor is arranged to maintain a thread mask, flagging those threads that relate to one or more virtual machines being backed up. When setting the image modification flag for a cache line modified by a virtual machine being backed up, the hypervisor refers to the thread mask to determine whether to set the image modification flag for the current cache line being modified. Each cache row further comprises a thread ID indicating which execution thread is responsible for modification of the cache line in the respective cache TOW.
  • A single bitfield register, called a thread mask, is added to each processor unit, with a number of bits equal to the number of hardware threads supported by that unit, and hypervisor-privileged operations added to set those bits. The hypervisor (which knows which virtual machines are running on which hardware threads) sets the associated bits in the thread mask for the hardware threads that are running virtual machines that require checkpoint-based high-availability protection. A new field, thread ID, is added alongside the image modification flag on every cache line. The thread ID field is sufficiently large to contain the ID of the hardware thread that issued the store operation (i.e., two bits if four hardware threads are supported). When a store is performed, the image modification flag is set in the cache, only if the store was not executed when running in the hypervisor privilege mode and if the thread mask bit corresponding to the currently executing hardware thread is set. As well as setting the image modification flag, these store operations can also write the ID of the hardware thread that issued the store to the cache line's thread ID field. When cache lines are logged during a cast-out, snoop intervention or cache-clean operation, the contents of the thread ID field associated with the cache line are also written to the log. Alternatively, the log record is directed to a different log based on the value of the thread ID, with the processor core capable of storing position and size information for multiple logs. When this alternative is used, it is not necessary to write the thread ID field to the log.
  • The above aspects allow multiple virtual machines to execute on a single processor unit concurrently, with any number of them running with checkpoint-based high-availability protection. The presence of the thread ID in the logs, coupled with the hypervisor's record of which virtual machines are currently running on which processor cores and hardware threads, is sufficient to allow the secondary host (the memory location where the backup image is stored) to update the correct virtual machine memory image on receipt of the logs.
  • The cache controller typically is further adapted to write the memory address of a flagged cache line in the defined log upon the eviction of the flagged line from the cache. This captures flagged changes to the VM image that are no longer guaranteed to be present in the cache during the periodic inspection of the image modification tags.
  • The computer system is further arranged to update a backup image of the virtual machine in a different memory location by retrieving the memory addresses from the log; obtaining the modified cache lines using the retrieved memory addresses; and updating the further image with said modified cache lines. The logged memory addresses are used to copy the altered data of the primary image to the copy of the VM image, which copy may for instance be located on another computer system.
  • In this manner, VM images may be synchronized without incurring additional page faults and reduces the traffic between systems due to the smaller granularity of the data modification, i.e. cache line-size rather than page size. Due to the fact that the VM is suspended during image replication, no page protection is necessary. This approach is furthermore page size-agnostic such that various page sizes can be used. Moreover, the additional hardware cost to the computer system is minimal; only minor changes to the cache controller, for example to the cast-out engine and the snoop-intervention engine of the cache controller, and to the cache rows of the cache are required to ensure that the cache controller periodically writes the memory address of the dirty cache line in the log through periodic inspection of the image modification flag during execution of the VM.
  • The computer system may replicate data from the primary VM image to a copy in push or pull fashion. In a push implementation, a processor unit from the same computer system, for example the processor unit running the VM or a different processor unit, may be also responsible, under control of the hypervisor, for updating the copy of the image of the VM in the different memory location, which may be a memory location in the memory of the same computer system or a memory location in the memory of a different computer system. In a pull implementation, a processor unit of a different computer system may be adapted to update the copy of the VM image in the memory location on this different computer system by pulling the memory addresses and associated modified cache lines from the computer system hosting the VM.
  • The cache may include a write-back cache, which may form part of a multi-level cache further including a write-through cache adapted to write cache lines into the write-back cache, wherein only the cache rows of the write-back cache comprise the flag. As by definition the cache lines in a write-through cache cannot get dirty because cache line modifications are also copied to a write-back cache, only the write-back caches need inspecting when periodically writing the memory addresses to the log.
  • As mentioned above the log which stores the addresses of changed cache lines is a circular buffer and the system comprises a plurality of registers adapted to store a first pointer to a wrap-around address of the circular buffer, a second pointer to the next available address of the circular buffer, a third pointer to an initial address of the circular buffer, and the size of the circular buffer. The cache controller is adapted to update at least the second pointer following the writing of a memory address in the log.
  • Each processor unit is configured to deduplicate the memory addresses in the log prior to the retrieval of the addresses from the log. This reduces the amount of time required for synchronizing data between the memories by ensuring that the altered data in a logged memory location is copied once only. In this manner, the log is updated with the memory addresses of the modified cache lines without the need to flush the modified cache lines from the cache at the same time.
  • The processor unit typically further performs the step of writing the memory address of a flagged cache line in the defined log upon the eviction of said flagged line from the cache to capture flagged changes to the VM image that no longer are guaranteed to be present in the cache during the periodic inspection of the image modification tags.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Embodiments of the present inventive subject matter will now be described, by way of example, with reference to the following drawings, in which:
  • FIG. 1 schematically depicts a computer system according to an embodiment of the present inventive subject matter;
  • FIG. 2 schematically depicts an aspect of a computer system according to an embodiment of the present inventive subject matter in more detail;
  • FIG. 3A schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail;
  • FIG. 3B schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail;
  • FIG. 4A schematically depicts a first portion of a flowchart of an aspect of a method of updating computer system according to an embodiment of the present inventive subject matter;
  • FIG. 4B schematically depicts a second portion of a flowchart of an aspect of a method of updating computer system according to an embodiment of the present inventive subject matter;
  • FIG. 5 schematically depicts a flowchart of another aspect of a method of updating computer system according to an embodiment of the present inventive subject matter;
  • FIG. 6 schematically depicts a flowchart of another aspect of a method of updating computer system according to another embodiment of the present inventive subject matter;
  • FIG. 7 schematically depicts a computer cluster according to an embodiment of the present inventive subject matter; and
  • FIG. 8A schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail;
  • FIG. 8B schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail; and
  • FIG. 8C schematically depicts another aspect of a computer system according to an embodiment of the present inventive subject matter in more detail.
  • DETAILED DESCRIPTION OF THE EMBODIMENTS
  • FIG. 1 schematically depicts a computer system 100. The computer system 100 comprises a plurality of processor units 110 a-110 d for hosting one or more virtual machines. In FIG. 1, four processor units 110 a-110 d are shown by way of example; it should be understood that the computer system 100 may comprise any suitable number of processor units. A processor unit is a unit of hardware that is capable of (pseudo-) autonomous execution of a computer program code, such as a processor, microprocessor or a core of a processor or microprocessor comprising a plurality of such cores. Each processor unit 110 a-110 d can be arranged to run a hypervisor, which is a software component that enables the provision of the virtual machine(s) to external users.
  • Each processor unit 110 a-110 d further is connected to and has access to a cache 120 a-120 d, which comprises a cache controller 122 a-122 d in addition to a pool of entries 124 a-124 d, with each entry including a cache line and one or more tags. Any suitable cache architecture may be used, for example a single cache or several levels of cache, such as a level-1 cache, a level-2 cache and a level-3 cache or suitable subsets thereof. The cache 120 a-120 d may reside in any suitable location. For instance, the cache 120 may be located on or in the vicinity of the processor unit 110 to reduce data retrieval latency.
  • In the embodiment shown in FIG. 1, each processor unit 110 a-110 d has access to a dedicated cache 120 a-120 d. Four caches 120 a-120 d are shown by way of example, one for each of the respective processor units 110 a-110 d. However, it should be understood that any suitable configuration may be chosen, for example a configuration in which a processor unit 110 a-110 d has access to multiple caches 120 a-120 d, which may be organized in a hierarchical structure, for example a combination of a level-1, level-2 and level-3 cache.
  • Each processor unit 110 a-110 d is communicatively coupled to bus architecture 130 through its respective cache 120 a-120 d, at least at a functional level. This means that any access of data by a processor unit 110 a-110 d will involve its cache 120 a-120 d. Any suitable bus architecture 130 may be chosen.
  • The computer system 100 further comprises a memory 140 coupled to the bus architecture 130, which again may take any suitable form, for example a memory integrated in the computer system or a distributed memory accessible over a network. The memory 140 is connected to the caches 120 a-120 d. The memory 140 may be volatile memory or non-volatile memory. Many other suitable embodiments of a memory 140 are possible. Although not shown, the computer system 100 may comprise additional components such as one or more network interfaces, input ports, and output ports.
  • The computer system 100 is adapted to host one or more virtual machines on the processor units 110 a-110 d, through the use of a hypervisor. A VM is a software representation of a computing device capable of hosting anything from a single computer program to a complete operating system, and which may be present itself as a separate system to the user of the computer system 100, such that the user has no awareness of the underlying computer system 100. For instance, in the case of the computer system 100 embodying a local area network (LAN) server having a plurality of processors each comprising a number of cores, the user accessing the LAN will be able to engage with the services hosted by the VMs but will be unaware of the underlying server.
  • One of the attractions of virtualization is improved robustness due to the ability to provide failover between VMs, which means that should a VM fail, a backup VM is available that will continue to provide the VM functionality to the user. To this end, a copy of a VM is periodically updated so that the copy represents the actual current state of the original VM in case the original VM exhibits a failure and will have to failover to the copy VM. The original VM will be referred to as the primary VM and its copy will be referred to as the secondary VM.
  • Such synchronization between the primary VM and the secondary VM requires the temporary suspension of the primary VM so that its state does not change during the synchronization. The duration of such suspension should be minimized such that the one or more users of the VM are not noticeably affected by the temporary suspension.
  • Typically, to avoid performance penalties, differential checkpoints which capture changes in the state of an entity since the last checkpoint are created. Such checkpoints may be generated by writing the address and data from a cache line to a secondary memory such as a level-2 cache or the system memory 140 as soon as the data in a cache line is altered. When using such checkpoint generation for VM replication purposes, it has the drawback that a large amount of data may be unnecessarily communicated during operation of the primary VM. For instance, if a cache line of the cache 120 a-120 d used by the primary VM is updated multiple times during the operation mode of the primary VM, previous versions of the data in the cache line are unnecessarily written to the secondary memory as this ‘old’ data has become redundant.
  • An example architecture of the data storage part 124 a-124 d of a cache 120 a-120 d is shown in FIG. 2. The data storage part 124 a-124 d comprises a plurality of cache rows 1210, with each cache row 1210 including a tag 1212 which is the address of the data in memory 140, a cache line 1214 and a number of flag bits. The flag bits comprise a valid bit 1215, which signals if the cache line 1214 is still relevant to the processor unit 110 a-110 d, a dirty bit 1216, which signals if the cache line 1214 has been altered such that it needs writing back to the address in memory 140 stored in the tag 1212, an image modification flag 1217 and a thread ID field 1218, which are described in more detail below.
  • The cache rows 1210 of a cache 120 a-120 d capable of containing dirty cache lines 1214 include the VM image modification bit flag 1217 that signals whether the cache line 1214 is modified by a processor unit 110 a-110 d executing a VM that is being backed up. In other words, this flag signals if the modified cache line 1214 forms part of a VM image for which a checkpoint based backup is operating. The cache controller 122 will set both the dirty bit flag 1216 and the VM image modification flag 1217 to true upon a write access of the cache line 1214 by the processor unit 110 a-110 d during the execution of a VM that is being backed up. The purpose of this will be explained in more detail below.
  • The processor unit 110 a-110 d hosting a primary VM may include a replication manager, which may be included in the design of the hypervisor, and/or which may be realized in hardware, in software, or a combination of hardware and software. The replication manager is adapted to create a log in the system memory 140 for logging the memory addresses of the cache lines 1214 modified during the execution of the VM. The data in the log is only accessible to the replication manager of a processor unit including other processor units 110 a-110 d of the computer system 100 or processor units 110 a-110 d of another computer system 100 as will be explained in more detail later.
  • In a preferred embodiment, the memory address log in the memory 140 has a defined size and allocation to avoid corruption of the memory 140. Any suitable implementation of such a log may be chosen. A particularly suitable implementation is shown in FIG. 3 a. In this embodiment, the log is defined as a circular buffer 200 in the system memory 140, and has a size 202 defined by the replication manager, which is preferably part of the hypervisor of the processor unit 110 a-110 d. The log 200 is designed to comprise a plurality of memory addresses in memory locations 204. A portion 206 is shown to indicate unused memory locations in the log 200, which comprises the free space available in the defined log 200.
  • In order to facilitate the management of the log 200 during the execution of a VM on the processor unit 110 a-110 d, the computer system 100 includes a set of registers 210 including a first register 212 in which the base address of the circular buffer 200 is stored, a second register 214 in which the next available address of the circular buffer is stored, a third register 216 in which the starting point of the circular buffer 200 is stored and a fourth register 218 in which the size 202 of the circular buffer 200 is stored. The set of registers 210 are located on the respective processor unit 110 a-110 d. In some implementations, the set of registers 210 may form part of the cache controller 122. The registers 210 also include a thread mask 220, which contains a flag for each thread being executed by the respective processor unit 110 a-110 d. The thread mask 220 indicates those threads that relate to a virtual machine that is being backed up. During initialization of the log 200, the replication manager of the processor element 110 a-110 d will populate the registers 212, 214, 216 and 218 and the thread mask 220 with the appropriate values after which execution of the VM(s) on the processor unit 110 a-110 d may start or resume.
  • The hardware architecture of the cache controller 122 is adapted to traverse the cache 120 a-120 d, inspect the VM image modification bit flags 1217, write the memory addresses of the cache lines 1214 and the thread ID 1218 to the log 200 of the cache lines 1214 that have a VM image modification flag 1217 set to true, and to clear the VM modifications flags 1217 once the corresponding memory addresses have been written to the log 200. The cache controller 122 performs these operations upon the temporary suspension of a VM by the hypervisor of its processor unit 110 a-110 d to facilitate the replication of the VM image and in response to a signal from the processor unit 110 a-110 d requesting that the memory addresses in the tags 1212 of the modified cache lines 1214 should be made available for replication of the VM image.
  • FIG. 3 a shows an arrangement of registers 210 for a processor unit 110 a-110 d that supports four hardware threads in which log records are emitted to a single log 200, with each record being tagged with the thread ID 1218. The per-hardware-thread processor privilege register, which indicates whether a hardware thread is running in hypervisor mode or not, is not shown. Since the address 204 stored in the log 200 is the address of a cache line, any given cache line address can be represented in 64 bits with the least-significant bits spare to contain the thread ID, so a log record can be wholly contained within 64 bits. As described above, cast-outs, snoop interventions and cache clean operations will emit all cache lines with the image modification flag 1217 set to the in-memory log, with the log 200 containing the thread ID and address of the entry.
  • When using an embodiment similar to that shown in FIG. 3 b, in which different hardware threads log to different buffers 200, there will be one set of base, producer head, barrier and size registers for each hardware thread. It is not necessary to use an explicit thread mask register, since a null value (such as a zero size) can be used in the existing registers to indicate that backup is disabled for that hardware thread. Cache lines that fit the criteria (backup enabled for the hardware thread, and not running in hypervisor privileged mode) will be marked in the cache with the image modification flag 1217 set and the thread ID indicated, and on cast-out, snoop intervention or cache clean will be written out to one of four logs, with the destination in memory identified by first examining the thread ID associated with that cache line, and then writing the cache line address to the address specified by the producer head register of the appropriate hardware thread.
  • Under both models, any change to the hardware thread-to-VM assignment (for example scheduling a VM to run on a hardware thread on which it was not previously running) would require a cache-clean operation to ensure that any image modification flag data for the virtual machine that was previously running on the hardware thread had been pushed out to the log 200 prior to the switch taking place, and the hypervisor should note at which point in the log the virtual machine switched from one to another, so that the processor unit 110 a-110 d is able to communicate these memory changes to the secondary host in terms of the virtual machine that has undergone modification, rather than the hardware thread that caused the modification.
  • In some implementations, the cache clean operation could be extended to only target specific thread IDs, allowing the operation to selectively clean only the cache lines associated with hardware threads that are being reassigned to another virtual machine. This would reduce the number of unnecessary log entries that were produced if, for example, three hardware threads were running code for virtual machine 0, and a fourth running code for virtual machine 1. A reassignment to have the fourth hardware thread run code for virtual machine 2 only requires that cache lines associated with the fourth hardware thread been written to the in-memory buffer before it can start executing code for virtual machine 2.
  • The process of setting the image modification flag 1217 is explained in more detail with the aid of FIG. 4, which shows a flowchart of an example embodiment of such an updating method. After starting the method, the replication manager creates the log 200 in the system memory 140 in step 410 and stores the relevant values of the base address, initial address (starting point), next available address and log size in the registers 212, 214, 216 and 218, as previously explained. The thread mask 220 is also populated, indicating which threads being executed by the processor unit 110 a-110 d relate to virtual machines being backed up. The cache controller 122 subsequently monitors and handles in step 420 accesses to the cache lines in the line memory 124 a-124 d of the cache 120 a-120 d by the processor unit 110 a-110 d (or any other processor unit).
  • In addition, the cache controller 122 performs a number of checks in step 420, which checks have been identified in FIG. 4 as steps 420′, 420″ and 420″ respectively. In step 420′, the cache controller 122 checks if the cache line access has caused a modification of the accessed cache line, in which case the cache controller set the flag 1216 signalling the cache line as being dirty, as is well-known per se. In case of such a modification of a cache line, the method proceeds from step 420′ to step 425, in which the cache controller 122 further checks if such a dirty cache line has been generated during the execution of a VM that is being backed up, via reference to the thread mask 220. If this is the case, the cache controller 122 also sets the VM image modification flag 1217 signalling the cache line as being a dirty cache line belonging to a VM image to be backed up in step 430 before returning to step 420. Any hypervisor actions in privilege mode also do not result in the image modification flag 1217 being set.
  • If the cache access does not lead to the modification of a cache line but instead causes the eviction of a cache line from the cache 120 a-120 d, as checked in step 420″, the method proceeds from step 420″ to step 435 in which the cache controller 122 checks if a cache line to be evicted from the cache 120 a-120 d is flagged as being modified by the VM, i.e. checks if the VM image modification flag 1217 of the cache line to be evicted is set to true. In case such a modified cache line is evicted from the cache, for example because of a fresh cache line requested by the processor unit 110 a-110 d forcing the eviction of a modified stale cache line from the cache 120 a-120 d or because of a further processor unit 110 a-110 d requesting sole access to a modified cache line residing in the cache 120 a-120 d, the cache controller 122, for example using the cast-out engine or the snoop-intervention engine, writes the memory address of the evicted cache line to the log 200 in step 440 so that this modification is captured in the log 200, after which the method returns to step 420. Obviously, when replacing such a cache line 1214 in the cache 120 a-120 d, its flags 1215, 1216 and 1217 are cleared or reset to the values that are appropriate for the fresh cache line. In case the cache access request does not involve the eviction of a cache line, it is further checked in step 420′″ if the cache access request is a request to generate a VM checkpoint. Such a request may originate from the replication manager of the processor unit 110 a-110 d hosting the VM, or may originate from a replication manager of another processor unit responsible for replicating the changes to the primary VM image during the execution of the VM in a secondary VM image. Step 420′″ occurs periodically, at regular intervals such as every 25 ms, so that the secondary VM image is regularly updated. Any suitable checkpoint generation frequency may be chosen.
  • It is noted that the checks 420′, 420″ and 420′″ are shown as a sequence of steps for the sake of clarity only. It should be understood that the cache controller 122 does not have to perform each of these checks to decide what cause of action should be taken next. It is for instance equally feasible that the cache controller 122 may immediately recognize that a cache line eviction or a VM image replication is required, in which case the cache controller 122 may proceed from step 420 directly to step 435 or step 460 respectively.
  • Upon detecting the checkpoint generation instruction in step 420′″, the cache controller 122 traverses the cache 120 a-120 d and inspects in step 460 the VM image modification flag 1217 of all cache rows 1210 that comprise such a flag. Upon detection of a VM image modification flag 1217 set to true, the cache controller retrieves the memory address of the associated cache line 1214 from tag 1212 and writes the retrieved memory address into the log 200 in step 470. To this end, the cache controller 122 retrieves the pointer of the next available address in the log 200 from the register 214, for example by fetching this pointer or requesting this pointer from the replication manager of the processor unit 110 a-110 d.
  • At this point, the pointer in register 214 will need updating to ensure that no memory addresses are overwritten. The pointer is updated by the cache controller 122, by the replication manager or by the hypervisor of the processor unit 110 a-110 d, although the latter implementation may negatively impact the performance of the hypervisor in cases where cache lines are frequently expelled. In some implementations, this updating step comprises moving the pointer forward by offsetting the pointer presently stored in the register 214 with the size of the stored memory address and writing this offset value in the register 214.
  • It is furthermore necessary to check if the next available address in the log 200 to be stored in register 214 should be wrapped around to the base address. In some implementations, the cache controller 122 or the replication manager of the processor unit 110 a-110 d will check if the next available address equals the base address plus the size of the log 200 as this indicates that the boundary of the address range of the log 200 in the system memory 140 has been reached. If this is the case, the cache controller 122 or the replication manager of the processor unit 110 a-110 d will set, i.e. wrap around, the next available address to the base address.
  • After completing step 470, the cache controller 122 subsequently resets the VM image modification flag to false in step 480. Step 480 may be executed at any suitable point in time, for example after each write action to the log 200, or after all write actions to the log 200 have been completed.
  • At this point, it is reiterated that any suitable cache architecture may be used for the cache 120 a-120 d. Such architectures may include different types of caches, such as a combination of a write-through cache and one or more write-back caches. A write-through cache retains data in the cache and at the same time, synchronously, pushes the data into a next level of the cache. This provides fast access times for subsequent reads of the cache lines 1214 by the processor unit 110 a-110 d at the cost of slower write actions, as the writer has to wait for the acknowledgement that the write action has been completed in the (slower) next level cache. By definition, a write-through cache does not contain dirty cache lines, as the cache lines are ‘cleaned up’ in one of the next level caches. Hence, where an embodiment of the present inventive subject matter includes a cache architecture including a write-through cache, the VM image modification flags 1217 may be omitted from the write-through cache and may be added to only those caches that can contain dirty cache lines, that is the write-back caches that do not push modified cache lines to a next level cache but are responsible for managing data coherency between caches and memory 140 as a consequence. Step 460 is typically applied to all caches in the cache architecture that have cache rows 1210 containing the VM image modification flag 1217, therefore all write-back caches.
  • At this point, the replication manager may trigger the replication of the VM image in memory 140 to another memory location, such as another memory or cache, by accessing the log 200, fetching the addresses stored in the log 200, fetching the cache lines stored at the fetched addresses and updating a copy of the VM image in the other memory location with the fetched cache lines, as previously explained.
  • It should be understood that the replication manager triggering the flush of the cache line addresses and the subsequent update of the secondary image of the VM does not have to be the replication manager of the processor unit 110 a-110 d running the VM. In an embodiment, the replication manager of another processor unit 110 a-110 d of the computer system 100 may be in charge of this update process.
  • Generally, the embodiments in which the processor unit in charge of the VM image update process resides on the same computer system 100 as the processor unit 110 a-110 d running the VM can be seen as embodiments in which the modified cache lines are pushed to another memory location. In some implementations, modified cache lines may be pulled from their primary memory location by a processor unit on a separate computer system, such as a processor unit responsible for hosting a secondary version of the VM, i.e. a processor unit to which the VM fails over, for example in case of a hardware failure of the processor unit hosting the primary VM. In such an implementation (as well as in an implementation where a different processor unit of the computer system hosting the VM is in charge of the VM image replication process), the processor unit 110 a-110 d hosting the VM forwards data relevant to the replication of its VM image in memory 140 including the values stored in the registers 212, 214, 216 and 218 to the replication manager of another processor unit, for example another processor unit in a different computer system, to allow this further replication manager to retrieve the altered cache lines using the addresses in the log 200, as will be explained in more detail later.
  • Upon writing the memory addresses of the modified cache lines 1214 in the log 200 in step 470, the method may further comprise the optional step of deduplicating addresses in the log 200 to remove multiple instances of the same address in the log 200. This for instance can occur if the frequency at which memory addresses are written to the log 200 is higher than the frequency at which the memory addresses in the log 200 are used to update a secondary VM image.
  • At this point, it is noted that the process of FIG. 4 has been described assuming that a primary VM is hosted by a single processor unit 110 a-110 d. It is emphasized that this is by way of non-limiting example only. It is for instance equally feasible that a VM is hosted by several processor units 110 a-110 d, for example several microprocessor cores, in which case several logs 200 (one for each core) may be maintained that track different modifications to the VM image in memory 140. In such a scenario, the optional deduplication step may for instance be performed over all logs 200 such that a memory address occurs only once in the combined logs 200 to reduce the amount of data that needs to be copied to the secondary VM during a differential checkpoint generation. The checkpoint generation may further require synchronization of other relevant states between the primary and secondary VMs, for example the state of the CPU, I/O involving disk(s) and network and so on.
  • The flowchart of FIG. 4 describes an example embodiment of a first operating mode of a processor unit 110 a-110 d, which may be referred to as a producer mode in which the processor unit 110 a-110 d produces the relevant data required for the replication of the image of the VM in the memory 140 to a copy of this image in, for example, the memory of another computer system. A processor unit 110 a-110 d can also operate in a second operating mode, in which it does not host a VM but is instead responsible for replicating the image of a primary VM. This second operating mode may be referred to as a consumer mode, as a processor unit 110 a-110 d in this mode is adapted to consume the modified cache lines in the VM image produced by a processor unit 110 a-110 d executing the VM in its first operation mode or producer mode.
  • For instance, a further processor unit 110 a-110 d of the computer system 100 including the processor unit 110 a-110 d hosting the VM may be responsible for updating a replica of the VM image in a further location, for example, a memory of another computer system. In some implementations, the processor unit 110 a-110 d hosting the VM may switch between operating modes to assume responsibility for updating this replica. In yet another implementation, a processor unit of another computer system, for example the computer system on which the replica is stored, is responsible for updating this replica of the VM image.
  • The update of the VM image replica ensures that a processor unit 110 a-110 d of a computer system 100 storing the replica in its memory can take over execution of the VM upon a hardware failure in the computer system 100 hosting the primary VM, leading to the termination of the execution of the primary VM on this system.
  • In some implementations, the second operating mode is not a separate operating mode but forms part of the first operating mode, in which case the processor unit 110 a-110 d responsible for the execution of the primary VM also is responsible for updating the replica of the VM in the further memory location.
  • It should be understood that in a computer cluster comprising multiple computer systems 100, some processor units 110 a-110 d may be in producer mode (i.e. VM hosting mode) whilst other processor units 110 a-110 d are in consumer mode (i.e. in VM image replication mode). Even a single computer system in such a cluster may comprise processor units 110 a-110 d in producer mode as well as in consumer mode, as previously explained. In some implementations, the replication manager may control whether a processor unit 110 a-110 d is in producer mode or consumer mode, for example by setting a hardware flag for the processor unit 110 a-110 d such that it can be recognized in which mode a processor unit 110 a-110 d is operating.
  • FIG. 5 depicts a flowchart of the method steps performed during such a second operating mode of a processor unit 110 a-110 d. In the consumer mode, a processor unit 110 a-110 d, for example the replication manager of the processor unit 110 a-110 d, receives the relevant information from the replication manager of the processor unit 110 a-110 d in producer mode, such as the contents of the registers 212, 214, 216 and 218 that will allow the replication manager of the consumer processor unit 110 a-110 d to access the memory 140 of the computer system 100 including the producer processor unit 110 a-110 d. The replication manager of the producer processor unit 110 a-110 d may volunteer the relevant information or may provide the relevant information upon a request thereto by the replication manager of the consumer processor unit 110 a-110 d. In an implementation where the processor unit 110 a-110 d hosting the VM also acts as the processor unit responsible for updating the secondary VM image, the above step may be omitted.
  • Upon retrieving the relevant information, the consumer processor unit 110 a-110 d retrieves the memory addresses stored in the log 200 created by the replication manager of the producer processor unit 110 a-110 d hosting the primary VM in step 510 and obtains the modified cache lines identified by the memory addresses in step 520. To this end, the consumer processor unit may send a data retrieval request over the bus architecture 130. Such requests are noticed by the cache controllers 122 of the computer system 100, for example by the snoop-intervention engines of the cache controllers 122, which will fetch the cache line 1214 from the cache 120 a-120 d if the memory address in the data retrieval request matches a memory address in one of the tags 1212 of the cache rows 1210 of the cache 120 a-120 d. The requesting processor unit 110 a-110 d will typically await the response from a cache controller 122 of a further processor unit 110 a-110 d for a defined period of time, after which the cache controller 122 of the requesting processor unit 110 a-110 d will fetch the cache line from the memory 140, as a non-response from the other cache controllers 122 will mean that the cache line 1214 no longer resides in cache but has been cast from the cache 120 a-120 d instead. The handling of such data retrieval requests in a computer system 100 comprising multiple processor units 110 a-110 d and caches 120 a-120 d may be accomplished using any suitable data retrieval protocol.
  • The consumer processor unit 110 a-110 d subsequently updates the copy of the VM image accordingly in step 530 by inserting the obtained modified cache line 1214 in the appropriate location of the VM image copy. This process is repeated until all addresses have been retrieved from the log 200 as checked in step 540, after which other state registers, if any, for example state registers of the CPU as previously explained, may be replicated as shown in step 550. At this point, the consumer processor unit 110 a-110 d may signal the producer processor unit 110 a-110 d hosting the primary VM that replication is complete, upon which the producer processor unit 110 a-110 d hosting the primary VM, for example its hypervisor, will terminate the suspension of the primary VM and reinitialize the log 200, resetting one or more of the registers 212, 214 and 216 in the cache management module 122.
  • It should be immediately apparent to the skilled person that various modifications may be possible to the method shown in FIG. 5. For instance, the consumer processor unit 110 a-110 d may have permission to deduplicate the addresses in the log 200 of the producer processor unit 110 a-110 d hosting the primary VM prior to retrieving the memory addresses from the log 200 in step 510.
  • In some implementations, a processor unit 110 a-110 d in the second operating mode, i.e. consumer mode, is adapted to speculatively process the log 200 of a processor unit 110 a-110 d in the first operating mode, i.e. producer mode. This implementation is for instance useful when the consumer processor unit does not trigger the cache controller 122 of the producer processor unit to write the modified cache line addresses to the log 200, for example in case the producer processor unit hosting the VM periodically triggers the update of the log 200. This allows the duration of the suspension of the primary VM to be further reduced as part of the log 200 will already have been processed by the consumer processor unit 110 a-110 d when the producer processor unit 110 a-110 d suspends the VM following the request to generate a checkpoint in step 420′″.
  • An example flowchart of this implementation is shown in FIG. 6. In the process of FIG. 6, several steps are identical to the method of FIG. 5, and these steps will therefore not be explained again for the sake of brevity. In steps 510, 520 and 530 of FIG. 6, the consumer processor unit 110 a-110 d retrieves a memory address from the log 200 of the processor unit 110 a-110 d hosting the primary VM, retrieves the data from the memory 140 in the computer system 100 of the producer processor unit 110 a-110 d and updates the secondary VM image as previously explained. In additional step 610, the consumer processor unit 110 a-110 d invokes the update of the initial address value of the log 200 as stored in register 216 associated with the producer processor unit 110 a-110 d hosting the primary VM. This may be achieved in any suitable way, for example by providing the replication manager of the consumer processor unit 110 a-110 d with write privileges to update this register or by the consumer processor unit 110 a-110 d instructing the replication manager of the producer processor element 110 a-110 d to update this register value accordingly.
  • Step 610 ensures that the available space in the log 200 of the processor unit 110 a-110 d hosting the primary VM is kept up-to-date, as the addresses already retrieved by the consumer processor unit 110 a-110 d may be overwritten, as indicated by the change in the initial address stored in the register 216 associated with the producer processor unit 110 a-110 d hosting the primary VM to the first address in the log 200 not yet processed by the consumer processor unit 110 a-110 d. When the primary VM becomes suspended, as checked in step 620 and all addresses have been retrieved from the log 200, the method may proceed to step 550 as previously explained in the detailed description of FIG. 5.
  • In some implementations, as soon as the primary VM becomes suspended, step 610 may be omitted from the process of FIG. 6, as it is no longer necessary to update the initial address value of the log 200 as stored in register 216 associated with the producer processor unit 110 a-110 d hosting the primary VM, as no further addresses will be written to the log 200 and the log 200 will be re-initialized prior to the reactivation of the primary VM.
  • FIG. 7 schematically depicts a computer cluster 700 that comprises a plurality of computer systems 100, which are communicatively coupled to each other via a network 720. The network 720 may be any suitable data communication network, for example a wired or wireless local area network, a wireless or wired wide area network, the Internet and so on. The computer cluster 700 is typically adapted to host a plurality of virtual machines on the processor units 110 a-110 d of the various computer systems 100 to be utilized by the users of the computer cluster 700. The computer cluster 700 benefits from the VM replication principles described above in that multiple up-to-date or mirror images of a VM may be generated in the respective memories 140 of at least some of the various computer systems 100 such that rapid VM failover can be provided with little overhead.
  • The above description describes modifying the cache hardware so that at regular intervals the circular buffer 200 in memory contains a list of all memory locations that have been modified by a given processor core since the last checkpoint. This is achieved by modifications to the cast-out engine and snoop-intervention engine in order to store in the log 200 memory addresses leaving the cache between checkpoints, and at a checkpoint there is initiated a cache flush to ensure that no modified data remains in the cache (thereby ensuring that dirty cache lines pass through the cast-out engine and thus are logged). If the circular buffer 200 becomes full, a full re-sync of memory must occur, or an immediate failover to the secondary system. This problem is addressed by ensuring that there is always sufficient space in the circular buffer 200 to accept any dirty data in the cache.
  • As shown in FIG. 8 a, the circular buffer 200 contains some unprocessed log entries, the producer core writes new entries to the log 200, and the registers indicate the location of the start and end of the unprocessed log entries. When the log entries reach the end of the buffer, they wrap-around to the beginning. As the consumer core processes entries, the “Unprocessed log entries start here” register is updated, as shown in FIG. 8 b. If the consumer core is unable to process the entries with sufficient speed, the processor core's entries can collide with the unprocessed log entries and this is the point at which a re-sync or failover must occur.
  • However, to avoid this occurring, there is used a guard band, which is the available space between the current location to which new logs are written, and the start of the unprocessed entries, which is shown in FIG. 8 c. If the head of the log entries moves to within the guard band, an interrupt is triggered. The size of the guard band may be static or dynamic in nature. The guard band is large enough to contain all the data that might be emitted as part of a checkpoint. This means that when an interrupt is delivered on entry to the guard band, execution of the producer core can be halted and a cache flush initiated. At this point, all of the required log entries are in the circular buffer, and the producer core can be resumed once the consumer core has processed enough log entries to clear the backlog. This avoids the need to do a full memory re-sync or failover in the event that the consumer core is unable to keep up with the producer core.
  • In some implementations, the guard band can be sized statically based on the worst-case possibility that, at the point where the guard band is reached it is assumed that all logged caches are full of dirty data, all instructions in the CPU pipeline that have been issued but have not yet completed are “store”-type instructions, and each of them will push out a dirty cache line (and thus emit a log entry) and create a new dirty cache line and in the time it takes for the interrupt to be delivered from the consumer cache to the consumer core, a certain number of new instructions will be issued, and each of those instructions are “store”-type operations, and each will push out a dirty cache line (and thus emit a log entry) and create a new dirty cache line. Thus, in an implementation with a write-though L1 cache, and write-back L2 and L3 caches, the required guard band size is:

  • sizeof(L2)+sizeof(L3)+worstcase(PIPELINE)+worstcase(INTERRUPT)
  • All of these elements are computable based on the architecture of a given microprocessor. In a further implementation, the cache-size related elements can be computed based on the number of dirty cache lines currently in the cache, rather than the worst-case number. This means that the size of the guard band can vary dynamically based on the number of log entries that would be emitted during the cache flush operation at the checkpoint. This is trivial to maintain within the cache, which is responsible for both tracking the fullness of the cache and also ensuring that the guard band is not reached. The PIPELINE and INTERRUPT portions of the calculation would remain constant.
  • It should be understood that in the context of the present inventive subject matter, a computer system is to be interpreted as a device that includes a collection of processor elements that can be utilized in unison. This does not necessarily equate to a single physical entity; it is equally feasible that a computer system is distributed over several physical entities, for example different boxes, or that a single physical entity includes more than one computer systems, for example several separate groups of processor units.
  • As will be appreciated by one skilled in the art, aspects of the present inventive subject matter may be embodied as a system, method or computer program product. Accordingly, aspects of the present inventive subject matter may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present inventive subject matter may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present inventive subject matter may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Aspects of the present inventive subject matter are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to implementations of the inventive subject matter. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present inventive subject matter. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • While particular implementations of the present inventive subject matter have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of the present inventive subject matter.

Claims (7)

What is claimed is:
1. A method comprising:
indicating, in a log, updates to memory of a virtual machine when the updates are evicted from a cache of the virtual machine;
determining a guard band for the log, wherein the guard band indicates a threshold amount of free space for the log;
determining that the guard band will be or has been encroached upon corresponding to indicating an update in the log;
updating a backup image of the virtual machine based, at least in part, on a set of one or more entries of the log, wherein the set of entries is sufficient to comply with the guard band; and
removing the set of entries from the log.
2. The method of claim 1, wherein said determining a guard band comprises:
determining a number of write-back cache lines in the cache;
determining a number of instructions in a pipeline for a processor unit that executes instructions issued by the virtual machine;
determining a number of additional instructions capable of being issued to the pipeline in the time taken to trigger an interrupt of the processor unit; and
defining the guard band based on a sum of the determined number of write-back cache lines, the determined number of instructions, and the determined number of additional instructions.
3. The method of claim 1, wherein said determining a guard band comprises:
determining a number of dirty cache lines in the cache;
determining a number of store instructions in a pipeline for a processor unit that executes instructions issued by the virtual machine;
determining a number of additional instructions capable of being issued to the pipeline in the time taken to trigger an interrupt of the processor unit; and
defining the guard band based on a sum of the determined number of dirty cache lines, the determined number of store instructions, and the determined number of additional instructions.
4. The method of claim 3 further comprising redefining the guard band in response to determining that another cache line has become dirty or that an additional store instruction has been issued to the pipeline.
5. The method of claim 1, wherein the cache of the virtual machine comprises a write-through cache, wherein said indicating in the log updates to memory of a virtual machine is in response to updates by the virtual machine to a cache line in the cache.
6. The method of claim 1 further comprising marking a cache line in the cache of the virtual machine for logging in response to modification of the cache line.
7. The method of claim 1, wherein each of the set of entries indicates a memory address of the virtual machine and data written to the memory address of the virtual machine, wherein updating the backup image based, at least in part, on the set of entries comprises indicating, for each memory address and data indicated in the set of entries, the data for updating a corresponding memory address of the backup image.
US14/727,245 2013-11-21 2015-06-01 Virtual machine backup Abandoned US20150378770A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/727,245 US20150378770A1 (en) 2013-11-21 2015-06-01 Virtual machine backup

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB1320537.2A GB2520503A (en) 2013-11-21 2013-11-21 Virtual machine backup
GB1320537.2 2013-11-21
US14/548,624 US9519502B2 (en) 2013-11-21 2014-11-20 Virtual machine backup
US14/727,245 US20150378770A1 (en) 2013-11-21 2015-06-01 Virtual machine backup

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/548,624 Continuation US9519502B2 (en) 2013-11-21 2014-11-20 Virtual machine backup

Publications (1)

Publication Number Publication Date
US20150378770A1 true US20150378770A1 (en) 2015-12-31

Family

ID=49883941

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/548,624 Expired - Fee Related US9519502B2 (en) 2013-11-21 2014-11-20 Virtual machine backup
US14/727,245 Abandoned US20150378770A1 (en) 2013-11-21 2015-06-01 Virtual machine backup

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/548,624 Expired - Fee Related US9519502B2 (en) 2013-11-21 2014-11-20 Virtual machine backup

Country Status (2)

Country Link
US (2) US9519502B2 (en)
GB (1) GB2520503A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9823842B2 (en) 2014-05-12 2017-11-21 The Research Foundation For The State University Of New York Gang migration of virtual machines using cluster-wide deduplication
CN107807839A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment for changing virtual machine memory data
WO2024056581A1 (en) * 2022-09-15 2024-03-21 International Business Machines Corporation Virtual machine failover with disaggregated shared memory

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11307854B2 (en) * 2018-02-07 2022-04-19 Intel Corporation Memory write log storage processors, methods, systems, and instructions
US10757215B2 (en) * 2018-09-13 2020-08-25 Pivotal Software, Inc. Allocation of computing resources for a computing system hosting multiple applications
CN109462651B (en) * 2018-11-19 2021-11-19 郑州云海信息技术有限公司 Method, device and system for cloud-up of mirror volume data and readable storage medium
US11782713B1 (en) * 2019-08-27 2023-10-10 Amazon Technologies, Inc. Security vulnerability mitigation using address space co-execution
CN111158858A (en) * 2019-12-26 2020-05-15 深信服科技股份有限公司 Cloning method and device of virtual machine and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5666514A (en) * 1994-07-01 1997-09-09 Board Of Trustees Of The Leland Stanford Junior University Cache memory containing extra status bits to indicate memory regions where logging of data should occur
US8407518B2 (en) * 2007-10-26 2013-03-26 Vmware, Inc. Using virtual machine cloning to create a backup virtual machine in a fault tolerant system
US8656388B2 (en) * 2010-09-30 2014-02-18 Avaya Inc. Method and apparatus for efficient memory replication for high availability (HA) protection of a virtual machine (VM)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9823842B2 (en) 2014-05-12 2017-11-21 The Research Foundation For The State University Of New York Gang migration of virtual machines using cluster-wide deduplication
US10156986B2 (en) 2014-05-12 2018-12-18 The Research Foundation For The State University Of New York Gang migration of virtual machines using cluster-wide deduplication
CN107807839A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment for changing virtual machine memory data
WO2024056581A1 (en) * 2022-09-15 2024-03-21 International Business Machines Corporation Virtual machine failover with disaggregated shared memory

Also Published As

Publication number Publication date
US9519502B2 (en) 2016-12-13
GB201320537D0 (en) 2014-01-01
GB2520503A (en) 2015-05-27
US20150143055A1 (en) 2015-05-21

Similar Documents

Publication Publication Date Title
US10649853B2 (en) Tracking modifications to a virtual machine image that occur during backup of the virtual machine
US9058195B2 (en) Virtual machines failover
US9519502B2 (en) Virtual machine backup
US9069701B2 (en) Virtual machine failover
US10339009B2 (en) System for flagging data modification during a virtual machine backup
US20150095585A1 (en) Consistent and efficient mirroring of nonvolatile memory state in virtualized environments
US9792208B2 (en) Techniques for logging addresses of high-availability data via a non-blocking channel
US11256533B1 (en) Transparent disk caching for virtual machines and applications
US20150095576A1 (en) Consistent and efficient mirroring of nonvolatile memory state in virtualized environments
US9430382B2 (en) Logging addresses of high-availability data
US9280465B2 (en) Techniques for moving checkpoint-based high-availability log and data directly from a producer cache to a consumer cache
US20190050228A1 (en) Atomic instructions for copy-xor of data
US20230409472A1 (en) Snapshotting Pending Memory Writes Using Non-Volatile Memory
CN115599594A (en) Error containment for enabling local checkpointing and recovery

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GUTHRIE, GUY L.;NAYAR, NARESH;NORTH, GERAINT;AND OTHERS;SIGNING DATES FROM 20141113 TO 20141124;REEL/FRAME:035757/0573

AS Assignment

Owner name: GLOBALFOUNDRIES U.S. 2 LLC, NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:037409/0869

Effective date: 20151028

AS Assignment

Owner name: GLOBALFOUNDRIES INC., CAYMAN ISLANDS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GLOBALFOUNDRIES U.S. 2 LLC;GLOBALFOUNDRIES U.S. INC.;SIGNING DATES FROM 20151208 TO 20151214;REEL/FRAME:037542/0087

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: GLOBALFOUNDRIES U.S. INC., NEW YORK

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:WILMINGTON TRUST, NATIONAL ASSOCIATION;REEL/FRAME:056987/0001

Effective date: 20201117