WO2008121573A1 - Managing write caching - Google Patents

Managing write caching Download PDF

Info

Publication number
WO2008121573A1
WO2008121573A1 PCT/US2008/057631 US2008057631W WO2008121573A1 WO 2008121573 A1 WO2008121573 A1 WO 2008121573A1 US 2008057631 W US2008057631 W US 2008057631W WO 2008121573 A1 WO2008121573 A1 WO 2008121573A1
Authority
WO
WIPO (PCT)
Prior art keywords
write
storage processor
cache
data
storage
Prior art date
Application number
PCT/US2008/057631
Other languages
French (fr)
Inventor
David R. French
H. Austin Spang Iv
Original Assignee
Emc Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Emc Corporation filed Critical Emc Corporation
Publication of WO2008121573A1 publication Critical patent/WO2008121573A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2089Redundant storage control functionality
    • G06F11/2092Techniques of failing over between control units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1666Error detection or correction of the data by redundancy in hardware where the redundant component is memory or memory area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring

Definitions

  • Data storage systems are used within computer networks and systems to store large amounts of data, e.g., data that is used by multiple servers and client computers (“hosts").
  • one or more servers are connected to the storage system, e.g., to supply data to and from a computer network.
  • the data may be transferred through the network to various users or clients.
  • the data storage system generally comprises a controller (“storage processor") that interacts with one or more storage devices such as one or more Magnetic disk drives or other forms of data storage.
  • the storage system comprises a write cache that allows data from the server to be temporarily stored in the write cache prior to being written to a storage device.
  • the server can send data to the storage system and quickly be provided an acknowledgement that the storage system has stored the data.
  • the acknowledgement is sent even though the storage system has only stored the data in the write cache and is waiting for an appropriate, convenient time to store the data in a storage device.
  • storing data to a write cache is much faster than storing data directly to a disk drive. Consequently, the write cache buffers a large amount of data in anticipation of subsequent storing of that data in a storage device.
  • One conventional data storage system includes two storage processors and an array of disk drives. Each storage processor includes, among other things, a local write cache. The local write caches mirror each other.
  • the storage processors perform read and write operations on behalf of one or more external hosts. Since the contents of the local write caches are mirrored, the storage processors initially attend to write operations in a write-back manner. That is, the write policy employed by the storage processors involves acknowledging host write operations once the write data is stored in both write caches. By the time the external hosts receive such acknowledgement, the storage processors may not have yet evicted the write data from the write caches to the array of disk drives.
  • the remaining storage processor vaults the contents of the its local write cache to one or more magnetic disk drives, and then disables its local write cache.
  • the remaining storage processor then destages the vaulted write cache contents (which are now stored on the magnetic disk drive) to the array of disk drives, i.e., the remaining storage processor empties the vaulted write cache contents by storing the vaulted write data to the array of disk drives.
  • the remaining storage processor is capable of performing host read and write operations while the remaining storage processor destages the vaulted write data contents and after such destaging is complete. For example, the remaining storage processor now carries out write operations in a write-through manner where the remaining storage processor stores new write data from an external host to the array of disk drives before acknowledging that the write operation is complete.
  • the remaining storage processor vaults the contents of its write cache to the magnetic disk drive and disables its write cache so that a second failure will not result in loss of the cached write data. For example, suppose that the remaining storage processor subsequently encounters a software failure after vaulting the write cache to the magnetic disk drive. When the remaining storage processor recovers from the software failure (i.e., performs a soft reset), the remaining storage processor overwrites its local write cache. In particular, Basic Input/Output System (BIOS) firmware directs the remaining storage processor to clear and test its local write cache. Additionally, the remaining storage processor uses at least a portion of its local write cache for temporarily holding Power-On Self Test (POST) code for running a Power-On Self Test. Although the contents of the local write cache have been overwritten, no write data is lost since the previously-cached write data was immediately vaulted to the magnetic disk drive and since all subsequently received write data is processed in a write-through manner.
  • BIOS Basic Input/Output System
  • Write caching is managed. Write-back caching operations are performed using a cache of a storage processor. After a failure of the storage processor, it is determined whether the cache includes a latest valid copy of cache data.
  • One or more implementations of the invention may provide one or more of the following advantages.
  • Data corruption can be avoided when write cache data is persisted through a reset or reboot.
  • Host data can be protected when one or more storage processors experience failures.
  • Fig. 1 is a block diagram of a data storage system which is configured to continue write caching after a storage processor failure.
  • Fig. 2 is a block diagram of a storage processor of the data storage system of Fig. 1.
  • Fig. 3 is a flow diagram of a procedure for use in the data storage system.
  • Conventional vaulting involves transitioning from a write-caching mode when two storage processors are available to a write -through mode when one of the storage processors fails but the other storage processor remains operational.
  • the remaining storage processor requires time to vault contents of its local write cache to a magnetic disk drive.
  • magnetic disk drives referenced herein, it is to be understood that other persistent storage such as solid state disks, flash drives, and/or optical disks may be used as well or instead.
  • the remaining storage processor and thus the data storage system as a whole is completely or largely unavailable to attend to further write or read operations from external hosts. Accordingly, a write or read operation submitted by an external host during this write cache vaulting event is likely to time out.
  • the process of destaging the vaulted write cache contents from the magnetic disk drive to the array of disk drives may take a considerable amount of time (e.g., several hours). During this time, there is a significant time latency associated with processing a new write operation.
  • the remaining storage processor receives new write data from an external host computer, the remaining storage processor first accesses the vaulted write cache contents from the magnetic disk drive to see if the new write operation pertains to any vaulted write data from an earlier write operation in order to maintain data integrity.
  • the remaining storage processor completes the earlier write operation, i.e., writes the vaulted write data to the array of disk drives. Then, the remaining storage processor processes the new write operation in a write-through manner.
  • the remaining storage processor After the remaining storage processor has finished destaging the vaulted write cache contents from the magnetic disk drive (or alternatively from a mini array of drives holding the vault) to the array of disk drives, the remaining storage processor continues to operate in write-though mode thus passing on a relatively large performance hit onto the external hosts. That is, the remaining storage processor stores write data onto the array of disk drives prior to acknowledging completion of write operations. Such operation results in significant latency compared to response times for processing write data in write-back mode.
  • the failed storage processor encounters only a minor failure (e.g., a software anomaly resulting in a soft reset). In such a situation, the failed storage processor is able to quickly recover from the failure, e.g., a couple of minutes to reboot and perform self-tests. Nevertheless, to prevent the recovered storage processor from interfering with the vault destaging process, the recovered storage processor cannot rejoin the data storage system until the remaining storage processor has completed destaging the vaulted write cache contents to the array of disk drives.
  • a minor failure e.g., a software anomaly resulting in a soft reset
  • the recovered storage processor must remain sidelined during the vault destaging process which may take several hours to complete.
  • an enhanced technique for responding to a storage processor failure involves continuing to perform write-back caching operations while the failed storage processor remains unavailable.
  • Such a technique alleviates the need for the remaining storage processor to vault cached write data to a magnetic disk and then destage the vaulted write data in response to the failed storage processor.
  • Such a technique provides better response time on new host write operations than write -through caching which is performed following the failure in conventional vaulting. Aspects of the technique are described in U.S. patent application serial no.
  • the storage processor in the event that a storage processor that is performing non-mirrored write caching has a sudden catastrophic failure ("panics"), the storage processor has the ability to recover the contents of the write cache after rebooting from the panic unless the integrity of the write cache image has been compromised, and can determine where the most recent version of the cache image resides on boot up. For example, if it is determined that the other storage processor has the most recent version of the cache image and the other storage processor has not booted yet, the storage processor continues booting without the write cache and loads the write cache image later when the other storage processor has booted.
  • panics sudden catastrophic failure
  • FIG. 1 is a block diagram of a data storage system 20 which is configured to continue write caching on behalf of a set of external host computers 22(1), 22(2), ... (collectively, external hosts 22) after a storage processor failure.
  • the external hosts 22 connect to the data storage system 20 via a respective communications medium 24(1), 24(2), ... (collectively, communications media 24).
  • the data storage system 20 includes multiple storage processors 26(A), 26(B) (collectively, storage processors 26), a cache mirror bus 28 and a set of disk drives 30(1), ... 30(N) (collectively, disk drives 30).
  • the storage processors 26 are configured to perform data storage operations (e.g., read operations, write operations, etc.) on behalf of the external hosts 22.
  • the cache mirror bus 28 is configured to convey data between caches of the storage processors 26 thus enabling cache mirroring between the storage processors 26.
  • the set of disk drives 30 enables the data storage system 20 to store and retrieve data on behalf of the external hosts 22 in a fault tolerant, non-volatile manner (e.g., using a RAID scheme).
  • Each storage processor 26 is configured to perform write-back caching in response to write operations 32 from the external hosts 22 while both storage processors 26 are in operation. That is, each storage processor 26 acknowledges completion of a write operation 32 once the write data reaches its local write cache and, if possible, once the write data is mirrored in the local write cache of the other storage processor 26. Additionally, each storage processor 26 is configured to continue to perform such write-back caching after a failure of the other storage processor 26. Such operation enables the data storage system 20 to provide improved response times and quicker recovery in the event a storage processor failure.
  • the storage processor 26(A) fails for a short period of time (e.g., due to an unanticipated soft reset).
  • the storage processor 26(B) continues to operating under a write-back write policy.
  • Such continued write-back operation alleviates the need to vault the write cache of the storage processor 26(B) which would otherwise make the data storage system 20 unavailable for a period of time. Additionally, such continued write-back operation avoids the performance hit associated with subsequently destaging the vaulted write cache contents to magnetic disk as well as running the data storage system in a write-through mode.
  • the storage processor 26(B) continues in write-back mode, the storage processor 26(A) is capable of easily becoming active again (i.e., rejoining in performance of host-based read and write operations) rather than having to wait until vault destaging is complete which could take several hours. Further details will now be provided with reference to Fig. 2.
  • Fig. 2 is a block diagram of each storage processor 26 of the data storage system 20.
  • Each storage processor 26 includes a communications interface 40, a controller 42 and a memory subsystem 44.
  • the communications interface 40 includes a host interface 46, a cache mirroring interface 48, and a disk interface 50.
  • the memory subsystem 44 includes a control circuit 52, a local write cache 54 and additional memory 58.
  • the additional memory 58 includes operating system storage, firmware for storing BIOS and POST code, optional flash memory, etc.
  • the communications interface 40 is configured to provide connectivity from the storage processor 26 to various other components.
  • the host interface 46 is configured to connect the storage processor 26 to one or more external hosts 22 through the connection media 24 (also see Fig. 1).
  • the cache mirroring interface 48 is configured to connect the storage processor 26 (e.g., the storage processor 26(A)) to another storage processor 26 (e.g., the storage processor 26(B)) to enable cache mirroring through the cache mirroring bus 28.
  • the disk interface 50 is configured to connect the storage processor 26 to the set of disk drives 30.
  • the controller 42 is configured to carryout data storage operations on behalf of one or more of the external hosts 22 through the communications interface 40 (e.g., see the write operations 32 in Fig. 1).
  • the controller 42 is implemented as a set of processors running an operating system which is capable of being stored in a designated area on one or more of the disk drives 30.
  • the controller 42 is implemented as logic circuitry (e.g., Application Specific Integrated Circuitry, Field Programmable Gate Arrays, etc.), microprocessors or processor chip sets, analog circuitry, various combinations thereof, and so on.
  • the memory subsystem 44 is configured to provide memory services to the controller 42.
  • the control circuitry 54 of the memory subsystem 54 is configured to provide persistent write caching using the write cache 56, i.e., enable the storage processor 26 to continue write caching even after a storage processor failure.
  • the control circuit 54 is further capable of performing other tasks using the additional memory 58 (e.g., operating a read cache, operating as an instruction cache, optionally vaulting contents of the write cache 56 into non-volatile flash memory or disk drive memory in response to a failure of the controller 42, etc.).
  • the technique has two parts: the write cache handling of the presence or removal of the other ("peer") storage processor, and the persistence of the single storage processor write cache image through a reboot and subsequent recovery of that image.
  • the first part helps keep the performance at an acceptable level when the peer storage processor is not present.
  • the second part helps keep the host data safe in case of a software panic of the storage processor that is single-board write caching.
  • step 3010 when the write cache is enabled and the peer storage processor panics or is removed, it is determined that the peer storage processor is becoming unavailable (step 3010). Once this determination is made, the write cache can transition to single-board write caching (step 3020). A persistent write cache session ID or other persistent tag is updated to indicate that this storage processor has the most recent copy of the write cache image (step 3030). Throughout this process, the write cache stays enabled and accepts host requests (step 3040). Once the write cache has transitioned to single-board write caching, the write cache can stay in that mode until the peer storage processor has returned (step 3050).
  • the peer storage processor When the peer storage processor returns, the current image of the write cache is transferred to the peer storage processor (step 3060), and mirrored write caching is re-enabled when the peer storage processor is ready (step 3070). If the peer storage processor comes up while the storage processor that was in single-board write caching is rebooting (step 3080), the returning peer storage processor detects that the other storage processor should have the most recent copy of the write cache data image (step 3090).
  • the write cache When the write cache is in single-board write caching mode, the write cache still handles all environmental events that were handled otherwise. This includes power issues (power supplies and standby power supply), fans faults, and vault drive faults unless they have been overridden by the user. After the environmental issue has been corrected, the write cache re-enables unless the user has commanded the write cache to be disabled.
  • the host application data When a storage processor is in single-board write caching mode and that storage processor panics, the host application data will be lost absent additional functionality.
  • the contents of the write cache are retained in memory through a reboot or, optionally, saved to a permanent storage device for retrieval after the storage processor reboots.
  • the storage processor reboots it can either retrieve the write cache data from the permanent storage or rebuild the write cache from the image in memory. If the peer storage processor comes up while the storage processor that was in single-board write caching is rebooting, the returning peer storage processor detects that the other storage processor should have the most recent copy of the data, and waits for it to reboot.
  • the write cache performs a vault dump if there are any write cache pages not yet written to disk ("dirty write cache pages") still in the write cache. This is done to help ensure that the write cache data is safe if the storage processor is power cycled or removed or rebooted. This is done at least in part to cover service situations in which the surviving storage processor must be rebooted in order to recover the other storage processor.
  • a key aspect of the technique is determining upon rebooting which one, if any, of the copies of the write cache is the latest valid copy, since both storage processors may have cache data available that has survived reboot. If the wrong copy is selected, data corruption may result. For example, the data in a too-old copy of the cache may not accurately reflect a host write request ("I/O") for which a host has already been sent an acknowledgement, or the data in a too-new copy of the cache may reflect a write I/O that is not yet complete according to the host.
  • I/O host write request
  • determining the latest valid copy depends on an analysis of a previous failure of one or both storage processors, which may include a software failure or a hardware failure or both. For example, if a storage processor has a software failure and reboots, or has a hardware failure and is not running at all, it is clear that the other storage processor has the latest valid copy.
  • the session ID tag is used to tag a copy of cache data in connection with a significant event so that later its contents can be validated and/or distinguished from another copy of cache data. For example, when a storage processor recognizes that the other storage processor is experiencing a failure, the storage processor updates its own session ID tag. If the storage processor reboots shortly thereafter, it can read its own session ID tag and the other storage processor's session ID tag, if available, and determine which copy of the cache data is the latest valid copy.
  • the storage processor can finish rebooting and determine from the session ID tags that the other storage processor's copy of cache data is the latest valid copy.
  • the storage processor determines that neither cache is a valid last copy.
  • a tag is updated the tag is written to vault drives as well, so that after a storage processor has booted it can read the tag from the drives to help determine whether either cache is a valid last copy.
  • both storage processors have the same tag, and cache data is mirrored between the storage processors.
  • the tag is updated (e.g., incremented) and written to disk drives on any significant event, e.g., when the other storage processor has failed or needs to reboot, when the cache size changes, when the cache page size changes, and/or when the cache is disabled and reenabled.
  • I/O is not accepted on board until the new tag is in place.
  • the technique provides improved storage system performance over conventional systems when the system is operating in a degraded mode (e.g., one of the storage processors is inoperative due to a non disruptive upgrade, or is rebooting, in a panic, or has been removed), by keeping the write cache enabled.
  • a degraded mode e.g., one of the storage processors is inoperative due to a non disruptive upgrade, or is rebooting, in a panic, or has been removed
  • the write cache can load a write cache image from either the vault on disk, the other storage processor, or can detect that the write cache memory has been persisted through a reboot.
  • a single storage processor (“SP”) panic or reboot situation involves one of the storage processors getting rebooted or panicking.
  • the surviving storage processor detects that the other storage processor has stopped responding and transitions from mirrored write caching to single- board write caching.
  • the surviving SP must also get a new session ID which is persisted outside of the vault space on the magnetic disks.
  • the new session ID is also saved in the write cache memory so that it can be persisted in case the surviving SP should panic.
  • the peer SP i.e., other SP
  • a single SP failure case is very similar to the single SP panic case except that the rebooted SP determines that it does not have a valid write cache image and therefore the surviving SP or the vault has the valid write cache image.
  • a staggered dual SP panic case occurs when one SP panics, and before the first SP reboots the surviving SP panics. Until the surviving SP panics, the behavior is the same as in the single SP panic case.
  • the valid write cache image is still in that SP 's memory.
  • the first SP to boot finds that neither its memory image nor the vault contains the valid write cache image. In this case, that SP keeps the write cache disabled.
  • the second SP to boot finds that it has the valid write cache image persisted in its memory.
  • the second SP communicates to the first SP that it has the valid write cache image and sends it to the first SP. Now the write cache can transition back to mirrored write caching.
  • a dual simultaneous SP panic can occur when both SPs panic simultaneously or when the second SP panics before detecting that the first SP had already panicked. In this case, both SPs have a valid copy of the write cache image and therefore the same session ID.
  • a distributed lock helps ensure that only one SP boots the data storage system operating system at a time, so the SP that comes up first is declared, by definition, to have the master copy. When the second SP is allowed to boot, the first SP sends over its copy of the write cache image for consistency. Rolling or alternating panics are also handled. Alternating panics are a repeated form of the single SP panic with the surviving SP alternating between the two SPs. Rolling panics are slightly different.
  • the write cache When an AC power failure or over temperature or multi-fan fault is detected, the write cache is notified and the write cache dumps the current write cache image to the vault drives.
  • the SP with the current write cache image performs the dump even if both SPs are functional.
  • Non-SP hardware faults include power supply faults, SPS faults, single fan faults and single or multiple vault drive faults.
  • the write cache ignores these faults when the write cache availability enhancement is enabled.

Abstract

Write caching is managed. Write-back caching operations are performed using a cache of a storage processor. After a failure of the storage processor, it is determined whether the cache includes a latest valid copy of cache data.

Description

MANAGING WRITE CACHING
BACKGROUND
Data storage systems are used within computer networks and systems to store large amounts of data, e.g., data that is used by multiple servers and client computers ("hosts"). Generally, one or more servers are connected to the storage system, e.g., to supply data to and from a computer network. The data may be transferred through the network to various users or clients. The data storage system generally comprises a controller ("storage processor") that interacts with one or more storage devices such as one or more Magnetic disk drives or other forms of data storage. To facilitate uninterrupted operation of the server as it reads and writes data from and to the storage system, as well as executes application programs for use by users, the storage system comprises a write cache that allows data from the server to be temporarily stored in the write cache prior to being written to a storage device. As such, the server can send data to the storage system and quickly be provided an acknowledgement that the storage system has stored the data. The acknowledgement is sent even though the storage system has only stored the data in the write cache and is waiting for an appropriate, convenient time to store the data in a storage device. As is well known in the art, storing data to a write cache is much faster than storing data directly to a disk drive. Consequently, the write cache buffers a large amount of data in anticipation of subsequent storing of that data in a storage device. One conventional data storage system includes two storage processors and an array of disk drives. Each storage processor includes, among other things, a local write cache. The local write caches mirror each other.
During operation, the storage processors perform read and write operations on behalf of one or more external hosts. Since the contents of the local write caches are mirrored, the storage processors initially attend to write operations in a write-back manner. That is, the write policy employed by the storage processors involves acknowledging host write operations once the write data is stored in both write caches. By the time the external hosts receive such acknowledgement, the storage processors may not have yet evicted the write data from the write caches to the array of disk drives.
If one of the storage processors fails during operation of the data storage system (e.g., a hardware failure, a software failure, a loss of power to one of the storage processors, etc.), the remaining storage processor vaults the contents of the its local write cache to one or more magnetic disk drives, and then disables its local write cache. The remaining storage processor then destages the vaulted write cache contents (which are now stored on the magnetic disk drive) to the array of disk drives, i.e., the remaining storage processor empties the vaulted write cache contents by storing the vaulted write data to the array of disk drives.
It should be understood that the remaining storage processor is capable of performing host read and write operations while the remaining storage processor destages the vaulted write data contents and after such destaging is complete. For example, the remaining storage processor now carries out write operations in a write-through manner where the remaining storage processor stores new write data from an external host to the array of disk drives before acknowledging that the write operation is complete.
It should be further understood that the remaining storage processor vaults the contents of its write cache to the magnetic disk drive and disables its write cache so that a second failure will not result in loss of the cached write data. For example, suppose that the remaining storage processor subsequently encounters a software failure after vaulting the write cache to the magnetic disk drive. When the remaining storage processor recovers from the software failure (i.e., performs a soft reset), the remaining storage processor overwrites its local write cache. In particular, Basic Input/Output System (BIOS) firmware directs the remaining storage processor to clear and test its local write cache. Additionally, the remaining storage processor uses at least a portion of its local write cache for temporarily holding Power-On Self Test (POST) code for running a Power-On Self Test. Although the contents of the local write cache have been overwritten, no write data is lost since the previously-cached write data was immediately vaulted to the magnetic disk drive and since all subsequently received write data is processed in a write-through manner. SUMMARY
Write caching is managed. Write-back caching operations are performed using a cache of a storage processor. After a failure of the storage processor, it is determined whether the cache includes a latest valid copy of cache data. One or more implementations of the invention may provide one or more of the following advantages.
Data corruption can be avoided when write cache data is persisted through a reset or reboot. Host data can be protected when one or more storage processors experience failures.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
Fig. 1 is a block diagram of a data storage system which is configured to continue write caching after a storage processor failure.
Fig. 2 is a block diagram of a storage processor of the data storage system of Fig. 1. Fig. 3 is a flow diagram of a procedure for use in the data storage system.
DETAILED DESCRIPTION
Conventional vaulting involves transitioning from a write-caching mode when two storage processors are available to a write -through mode when one of the storage processors fails but the other storage processor remains operational. For example, the remaining storage processor requires time to vault contents of its local write cache to a magnetic disk drive. (With respect to magnetic disk drives referenced herein, it is to be understood that other persistent storage such as solid state disks, flash drives, and/or optical disks may be used as well or instead.) During this time, the remaining storage processor and thus the data storage system as a whole is completely or largely unavailable to attend to further write or read operations from external hosts. Accordingly, a write or read operation submitted by an external host during this write cache vaulting event is likely to time out.
Additionally, the process of destaging the vaulted write cache contents from the magnetic disk drive to the array of disk drives may take a considerable amount of time (e.g., several hours). During this time, there is a significant time latency associated with processing a new write operation. In particular, if the remaining storage processor receives new write data from an external host computer, the remaining storage processor first accesses the vaulted write cache contents from the magnetic disk drive to see if the new write operation pertains to any vaulted write data from an earlier write operation in order to maintain data integrity. If such vaulted write data exists and if the vaulted write data must be preserved in the event that the new write is aborted, the remaining storage processor completes the earlier write operation, i.e., writes the vaulted write data to the array of disk drives. Then, the remaining storage processor processes the new write operation in a write-through manner.
Furthermore, after the remaining storage processor has finished destaging the vaulted write cache contents from the magnetic disk drive (or alternatively from a mini array of drives holding the vault) to the array of disk drives, the remaining storage processor continues to operate in write-though mode thus passing on a relatively large performance hit onto the external hosts. That is, the remaining storage processor stores write data onto the array of disk drives prior to acknowledging completion of write operations. Such operation results in significant latency compared to response times for processing write data in write-back mode.
Moreover, there may be instances where the failed storage processor encounters only a minor failure (e.g., a software anomaly resulting in a soft reset). In such a situation, the failed storage processor is able to quickly recover from the failure, e.g., a couple of minutes to reboot and perform self-tests. Nevertheless, to prevent the recovered storage processor from interfering with the vault destaging process, the recovered storage processor cannot rejoin the data storage system until the remaining storage processor has completed destaging the vaulted write cache contents to the array of disk drives.
Accordingly, the recovered storage processor must remain sidelined during the vault destaging process which may take several hours to complete.
In contrast to conventional vaulting which involves transitioning from a write- caching mode when two storage processors are available to a write-through mode when one of the storage processors fails but the other storage processor remains, an enhanced technique for responding to a storage processor failure involves continuing to perform write-back caching operations while the failed storage processor remains unavailable. Such a technique alleviates the need for the remaining storage processor to vault cached write data to a magnetic disk and then destage the vaulted write data in response to the failed storage processor. Furthermore, such a technique provides better response time on new host write operations than write -through caching which is performed following the failure in conventional vaulting. Aspects of the technique are described in U.S. patent application serial no. 1 1/529,124, filed September 28, 2006, entitled RESPONDING TO A STORAGE PROCESSOR FAILURE WITH CONTINUED WRITE CACHING, attorney docket no. EMC-06-329, which is assigned to the same assignee as the present invention, and which is hereby incorporated herein by reference.
In another aspect of the technique as described below, in the event that a storage processor that is performing non-mirrored write caching has a sudden catastrophic failure ("panics"), the storage processor has the ability to recover the contents of the write cache after rebooting from the panic unless the integrity of the write cache image has been compromised, and can determine where the most recent version of the cache image resides on boot up. For example, if it is determined that the other storage processor has the most recent version of the cache image and the other storage processor has not booted yet, the storage processor continues booting without the write cache and loads the write cache image later when the other storage processor has booted. Fig. 1 is a block diagram of a data storage system 20 which is configured to continue write caching on behalf of a set of external host computers 22(1), 22(2), ... (collectively, external hosts 22) after a storage processor failure. The external hosts 22 connect to the data storage system 20 via a respective communications medium 24(1), 24(2), ... (collectively, communications media 24).
The data storage system 20 includes multiple storage processors 26(A), 26(B) (collectively, storage processors 26), a cache mirror bus 28 and a set of disk drives 30(1), ... 30(N) (collectively, disk drives 30). The storage processors 26 are configured to perform data storage operations (e.g., read operations, write operations, etc.) on behalf of the external hosts 22. The cache mirror bus 28 is configured to convey data between caches of the storage processors 26 thus enabling cache mirroring between the storage processors 26. The set of disk drives 30 enables the data storage system 20 to store and retrieve data on behalf of the external hosts 22 in a fault tolerant, non-volatile manner (e.g., using a RAID scheme). Each storage processor 26 is configured to perform write-back caching in response to write operations 32 from the external hosts 22 while both storage processors 26 are in operation. That is, each storage processor 26 acknowledges completion of a write operation 32 once the write data reaches its local write cache and, if possible, once the write data is mirrored in the local write cache of the other storage processor 26. Additionally, each storage processor 26 is configured to continue to perform such write-back caching after a failure of the other storage processor 26. Such operation enables the data storage system 20 to provide improved response times and quicker recovery in the event a storage processor failure.
For example, suppose that the storage processor 26(A) fails for a short period of time (e.g., due to an unanticipated soft reset). The storage processor 26(B) continues to operating under a write-back write policy. Such continued write-back operation alleviates the need to vault the write cache of the storage processor 26(B) which would otherwise make the data storage system 20 unavailable for a period of time. Additionally, such continued write-back operation avoids the performance hit associated with subsequently destaging the vaulted write cache contents to magnetic disk as well as running the data storage system in a write-through mode. Moreover, since the storage processor 26(B) continues in write-back mode, the storage processor 26(A) is capable of easily becoming active again (i.e., rejoining in performance of host-based read and write operations) rather than having to wait until vault destaging is complete which could take several hours. Further details will now be provided with reference to Fig. 2.
Fig. 2 is a block diagram of each storage processor 26 of the data storage system 20. Each storage processor 26 includes a communications interface 40, a controller 42 and a memory subsystem 44. The communications interface 40 includes a host interface 46, a cache mirroring interface 48, and a disk interface 50. The memory subsystem 44 includes a control circuit 52, a local write cache 54 and additional memory 58. The additional memory 58 includes operating system storage, firmware for storing BIOS and POST code, optional flash memory, etc.
The communications interface 40 is configured to provide connectivity from the storage processor 26 to various other components. In particular, the host interface 46 is configured to connect the storage processor 26 to one or more external hosts 22 through the connection media 24 (also see Fig. 1). The cache mirroring interface 48 is configured to connect the storage processor 26 (e.g., the storage processor 26(A)) to another storage processor 26 (e.g., the storage processor 26(B)) to enable cache mirroring through the cache mirroring bus 28. The disk interface 50 is configured to connect the storage processor 26 to the set of disk drives 30.
The controller 42 is configured to carryout data storage operations on behalf of one or more of the external hosts 22 through the communications interface 40 (e.g., see the write operations 32 in Fig. 1). In some arrangements, the controller 42 is implemented as a set of processors running an operating system which is capable of being stored in a designated area on one or more of the disk drives 30. In other arrangements, the controller 42 is implemented as logic circuitry (e.g., Application Specific Integrated Circuitry, Field Programmable Gate Arrays, etc.), microprocessors or processor chip sets, analog circuitry, various combinations thereof, and so on.
The memory subsystem 44 is configured to provide memory services to the controller 42. In particular, the control circuitry 54 of the memory subsystem 54 is configured to provide persistent write caching using the write cache 56, i.e., enable the storage processor 26 to continue write caching even after a storage processor failure. The control circuit 54 is further capable of performing other tasks using the additional memory 58 (e.g., operating a read cache, operating as an instruction cache, optionally vaulting contents of the write cache 56 into non-volatile flash memory or disk drive memory in response to a failure of the controller 42, etc.).
In general, the technique has two parts: the write cache handling of the presence or removal of the other ("peer") storage processor, and the persistence of the single storage processor write cache image through a reboot and subsequent recovery of that image. The first part helps keep the performance at an acceptable level when the peer storage processor is not present. The second part, as described below, helps keep the host data safe in case of a software panic of the storage processor that is single-board write caching.
With reference to Fig. 3, when the write cache is enabled and the peer storage processor panics or is removed, it is determined that the peer storage processor is becoming unavailable (step 3010). Once this determination is made, the write cache can transition to single-board write caching (step 3020). A persistent write cache session ID or other persistent tag is updated to indicate that this storage processor has the most recent copy of the write cache image (step 3030). Throughout this process, the write cache stays enabled and accepts host requests (step 3040). Once the write cache has transitioned to single-board write caching, the write cache can stay in that mode until the peer storage processor has returned (step 3050). When the peer storage processor returns, the current image of the write cache is transferred to the peer storage processor (step 3060), and mirrored write caching is re-enabled when the peer storage processor is ready (step 3070). If the peer storage processor comes up while the storage processor that was in single-board write caching is rebooting (step 3080), the returning peer storage processor detects that the other storage processor should have the most recent copy of the write cache data image (step 3090).
When the write cache is in single-board write caching mode, the write cache still handles all environmental events that were handled otherwise. This includes power issues (power supplies and standby power supply), fans faults, and vault drive faults unless they have been overridden by the user. After the environmental issue has been corrected, the write cache re-enables unless the user has commanded the write cache to be disabled.
When a storage processor is in single-board write caching mode and that storage processor panics, the host application data will be lost absent additional functionality. In order to help minimize the risk to host application data, the contents of the write cache are retained in memory through a reboot or, optionally, saved to a permanent storage device for retrieval after the storage processor reboots. When the storage processor reboots, it can either retrieve the write cache data from the permanent storage or rebuild the write cache from the image in memory. If the peer storage processor comes up while the storage processor that was in single-board write caching is rebooting, the returning peer storage processor detects that the other storage processor should have the most recent copy of the data, and waits for it to reboot.
If the write cache enters the disabled state due to being disabled by the user or by a system fault and the storage processor is in single-board write caching mode, the write cache performs a vault dump if there are any write cache pages not yet written to disk ("dirty write cache pages") still in the write cache. This is done to help ensure that the write cache data is safe if the storage processor is power cycled or removed or rebooted. This is done at least in part to cover service situations in which the surviving storage processor must be rebooted in order to recover the other storage processor.
A key aspect of the technique is determining upon rebooting which one, if any, of the copies of the write cache is the latest valid copy, since both storage processors may have cache data available that has survived reboot. If the wrong copy is selected, data corruption may result. For example, the data in a too-old copy of the cache may not accurately reflect a host write request ("I/O") for which a host has already been sent an acknowledgement, or the data in a too-new copy of the cache may reflect a write I/O that is not yet complete according to the host.
In at least some cases, determining the latest valid copy depends on an analysis of a previous failure of one or both storage processors, which may include a software failure or a hardware failure or both. For example, if a storage processor has a software failure and reboots, or has a hardware failure and is not running at all, it is clear that the other storage processor has the latest valid copy.
The session ID tag is used to tag a copy of cache data in connection with a significant event so that later its contents can be validated and/or distinguished from another copy of cache data. For example, when a storage processor recognizes that the other storage processor is experiencing a failure, the storage processor updates its own session ID tag. If the storage processor reboots shortly thereafter, it can read its own session ID tag and the other storage processor's session ID tag, if available, and determine which copy of the cache data is the latest valid copy.
In another example, if a storage processor panics (causing the other storage processor to update its session ID tag) and is in the midst of rebooting when the other storage processor has a sudden hardware failure after having accepted I/O, the storage processor can finish rebooting and determine from the session ID tags that the other storage processor's copy of cache data is the latest valid copy.
If there is no session ID tag available, the storage processor determines that neither cache is a valid last copy. When a tag is updated the tag is written to vault drives as well, so that after a storage processor has booted it can read the tag from the drives to help determine whether either cache is a valid last copy. In normal operating conditions, both storage processors have the same tag, and cache data is mirrored between the storage processors. In at least one implementation the tag is updated (e.g., incremented) and written to disk drives on any significant event, e.g., when the other storage processor has failed or needs to reboot, when the cache size changes, when the cache page size changes, and/or when the cache is disabled and reenabled. When the tag is to be updated, I/O is not accepted on board until the new tag is in place.
Advantageously, the technique provides improved storage system performance over conventional systems when the system is operating in a degraded mode (e.g., one of the storage processors is inoperative due to a non disruptive upgrade, or is rebooting, in a panic, or has been removed), by keeping the write cache enabled. In order to reduce the risk of losing write cache data in the event of a storage processor software panic while single storage processor ("single-board") write caching is enabled, the write cache can load a write cache image from either the vault on disk, the other storage processor, or can detect that the write cache memory has been persisted through a reboot. Now described are several common panic/failure scenarios and how they are handled in accordance with the technique.
A single storage processor ("SP") panic or reboot situation involves one of the storage processors getting rebooted or panicking. When one of the storage processors panics or is rebooted, the surviving storage processor detects that the other storage processor has stopped responding and transitions from mirrored write caching to single- board write caching. The surviving SP must also get a new session ID which is persisted outside of the vault space on the magnetic disks. The new session ID is also saved in the write cache memory so that it can be persisted in case the surviving SP should panic. When the peer SP (i.e., other SP) has rebooted and is coming up, it determines whether it or the surviving SP or the vault on disk has the most recent write cache image by comparing the session IDs from each location. If (as expected in this scenario) the surviving SP has the most recent copy, the surviving SP sends the write cache image to the peer SP over a cache mirroring connection (e.g., interface 48). Once this is done, both SPs can transition back to mirrored write caching. A single SP failure case is very similar to the single SP panic case except that the rebooted SP determines that it does not have a valid write cache image and therefore the surviving SP or the vault has the valid write cache image.
A staggered dual SP panic case occurs when one SP panics, and before the first SP reboots the surviving SP panics. Until the surviving SP panics, the behavior is the same as in the single SP panic case. When the surviving SP panics, the valid write cache image is still in that SP 's memory. The first SP to boot finds that neither its memory image nor the vault contains the valid write cache image. In this case, that SP keeps the write cache disabled. The second SP to boot (the original surviving SP) finds that it has the valid write cache image persisted in its memory. The second SP communicates to the first SP that it has the valid write cache image and sends it to the first SP. Now the write cache can transition back to mirrored write caching.
A dual simultaneous SP panic can occur when both SPs panic simultaneously or when the second SP panics before detecting that the first SP had already panicked. In this case, both SPs have a valid copy of the write cache image and therefore the same session ID. A distributed lock helps ensure that only one SP boots the data storage system operating system at a time, so the SP that comes up first is declared, by definition, to have the master copy. When the second SP is allowed to boot, the first SP sends over its copy of the write cache image for consistency. Rolling or alternating panics are also handled. Alternating panics are a repeated form of the single SP panic with the surviving SP alternating between the two SPs. Rolling panics are slightly different. In a rolling panic situation, if the SP that panics does not boot up enough to load the write cache image from the surviving SP, this becomes a single SP panic. If the SP that panics boots up enough to load the write cache image from the surviving SP and then panics, this becomes a repeated single SP panic. In all such cases, this type of panic does not cause the loss of data.
When an AC power failure or over temperature or multi-fan fault is detected, the write cache is notified and the write cache dumps the current write cache image to the vault drives. The SP with the current write cache image performs the dump even if both SPs are functional.
Non-SP hardware faults include power supply faults, SPS faults, single fan faults and single or multiple vault drive faults. In accordance with the technique, when the technique is enabled, the write cache ignores these faults when the write cache availability enhancement is enabled. While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

CLAIMSWhat is claimed is:
1. A method for use in managing write caching, the method comprising: performing write-back caching operations using a cache of a storage processor; and after a failure of the storage processor, determining whether the cache includes a latest valid copy of cache data.
2. The method of claim 1, further comprising: persisting the cache data through the failure of the storage processor.
3. The method of claim 1, further comprising: determining whether another cache of another storage processor includes a latest valid copy of cache data.
4. The method of claim 1, further comprising: using a session ID in the determination of whether the cache includes a latest valid copy of cache data.
5. A system for use in managing write caching, the system comprising: first logic performing write-back caching operations using a cache of a storage processor; and second logic determining, after a failure of the storage processor, determining whether the cache includes a latest valid copy of cache data.
6. The system of claim 5, further comprising: third logic persisting the cache data through the failure of the storage processor.
7. The system of claim 5, further comprising: third logic determining whether another cache of another storage processor includes a latest valid copy of cache data.
8. The system of claim 5, further comprising: third logic using a session ID in the determination of whether the cache includes a latest valid copy of cache data.
PCT/US2008/057631 2007-03-29 2008-03-20 Managing write caching WO2008121573A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US72970507A 2007-03-29 2007-03-29
US11/729,705 2007-03-29

Publications (1)

Publication Number Publication Date
WO2008121573A1 true WO2008121573A1 (en) 2008-10-09

Family

ID=39529712

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/057631 WO2008121573A1 (en) 2007-03-29 2008-03-20 Managing write caching

Country Status (1)

Country Link
WO (1) WO2008121573A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150082081A1 (en) * 2013-09-16 2015-03-19 International Business Machines Corporation Write cache protection in a purpose built backup appliance
CN112162940A (en) * 2020-09-11 2021-01-01 北京浪潮数据技术有限公司 Method, device and system for reducing cache fault domain and storage system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1162538A2 (en) * 1998-02-13 2001-12-12 Oracle Corporation Transferring a resource from a first cache to a second cache
US6539495B1 (en) * 1999-02-22 2003-03-25 International Business Machines Corporation Method, system and program products for providing user-managed duplexing of coupling facility cache structures
US20040117580A1 (en) * 2002-12-13 2004-06-17 Wu Chia Y. System and method for efficiently and reliably performing write cache mirroring
US20050005188A1 (en) * 2003-06-20 2005-01-06 International Business Machines Corporation Preserving cache data against cluster reboot

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1162538A2 (en) * 1998-02-13 2001-12-12 Oracle Corporation Transferring a resource from a first cache to a second cache
US6539495B1 (en) * 1999-02-22 2003-03-25 International Business Machines Corporation Method, system and program products for providing user-managed duplexing of coupling facility cache structures
US20040117580A1 (en) * 2002-12-13 2004-06-17 Wu Chia Y. System and method for efficiently and reliably performing write cache mirroring
US20050005188A1 (en) * 2003-06-20 2005-01-06 International Business Machines Corporation Preserving cache data against cluster reboot

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150082081A1 (en) * 2013-09-16 2015-03-19 International Business Machines Corporation Write cache protection in a purpose built backup appliance
US9507671B2 (en) * 2013-09-16 2016-11-29 Globalfoundries Inc. Write cache protection in a purpose built backup appliance
CN112162940A (en) * 2020-09-11 2021-01-01 北京浪潮数据技术有限公司 Method, device and system for reducing cache fault domain and storage system

Similar Documents

Publication Publication Date Title
US7849350B2 (en) Responding to a storage processor failure with continued write caching
US9081691B1 (en) Techniques for caching data using a volatile memory cache and solid state drive
US7975169B2 (en) Memory preserved cache to prevent data loss
US7793061B1 (en) Techniques for using flash-based memory as a write cache and a vault
US7380055B2 (en) Apparatus and method in a cached raid controller utilizing a solid state backup device for improving data availability time
EP1705574A2 (en) Non-volatile backup for data cache
US6978398B2 (en) Method and system for proactively reducing the outage time of a computer system
US7370248B2 (en) In-service raid mirror reconfiguring
US7051174B2 (en) Method, system, and program for restoring data in cache
US6513097B1 (en) Method and system for maintaining information about modified data in cache in a storage system for use during a system failure
US7761734B2 (en) Automated firmware restoration to a peer programmable hardware device
US7895465B2 (en) Memory preserved cache failsafe reboot mechanism
US7650467B2 (en) Coordination of multiprocessor operations with shared resources
US9507671B2 (en) Write cache protection in a purpose built backup appliance
US20150012699A1 (en) System and method of versioning cache for a clustering topology
US5623625A (en) Computer network server backup with posted write cache disk controllers
US20080162915A1 (en) Self-healing computing system
JPH10105467A (en) Method and device for keeping consistency of cache in raid controller provided with redundant cache
US20040181639A1 (en) Method, system, and program for establishing and using a point-in-time copy relationship
US20030005354A1 (en) System and method for servicing requests to a storage array
US11221927B2 (en) Method for the implementation of a high performance, high resiliency and high availability dual controller storage system
US20040153738A1 (en) Redundancy management method for BIOS, data processing apparatus and storage system for using same
US10001826B2 (en) Power management mechanism for data storage environment
WO2008121573A1 (en) Managing write caching
US20150019822A1 (en) System for Maintaining Dirty Cache Coherency Across Reboot of a Node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08744120

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08744120

Country of ref document: EP

Kind code of ref document: A1