WO2008121573A1

WO2008121573A1 - Managing write caching

Info

Publication number: WO2008121573A1
Application number: PCT/US2008/057631
Authority: WO
Inventors: David R. French; H. Austin Spang Iv
Original assignee: Emc Corporation
Priority date: 2007-03-29
Filing date: 2008-03-20
Publication date: 2008-10-09

Abstract

Write caching is managed. Write-back caching operations are performed using a cache of a storage processor. After a failure of the storage processor, it is determined whether the cache includes a latest valid copy of cache data.

Description

MANAGING WRITE CACHING

BACKGROUND

Data storage systems are used within computer networks and systems to store large amounts of data, e.g., data that is used by multiple servers and client computers ("hosts"). Generally, one or more servers are connected to the storage system, e.g., to supply data to and from a computer network. The data may be transferred through the network to various users or clients. The data storage system generally comprises a controller ("storage processor") that interacts with one or more storage devices such as one or more Magnetic disk drives or other forms of data storage. To facilitate uninterrupted operation of the server as it reads and writes data from and to the storage system, as well as executes application programs for use by users, the storage system comprises a write cache that allows data from the server to be temporarily stored in the write cache prior to being written to a storage device. As such, the server can send data to the storage system and quickly be provided an acknowledgement that the storage system has stored the data. The acknowledgement is sent even though the storage system has only stored the data in the write cache and is waiting for an appropriate, convenient time to store the data in a storage device. As is well known in the art, storing data to a write cache is much faster than storing data directly to a disk drive. Consequently, the write cache buffers a large amount of data in anticipation of subsequent storing of that data in a storage device. One conventional data storage system includes two storage processors and an array of disk drives. Each storage processor includes, among other things, a local write cache. The local write caches mirror each other.

During operation, the storage processors perform read and write operations on behalf of one or more external hosts. Since the contents of the local write caches are mirrored, the storage processors initially attend to write operations in a write-back manner. That is, the write policy employed by the storage processors involves acknowledging host write operations once the write data is stored in both write caches. By the time the external hosts receive such acknowledgement, the storage processors may not have yet evicted the write data from the write caches to the array of disk drives.

If one of the storage processors fails during operation of the data storage system (e.g., a hardware failure, a software failure, a loss of power to one of the storage processors, etc.), the remaining storage processor vaults the contents of the its local write cache to one or more magnetic disk drives, and then disables its local write cache. The remaining storage processor then destages the vaulted write cache contents (which are now stored on the magnetic disk drive) to the array of disk drives, i.e., the remaining storage processor empties the vaulted write cache contents by storing the vaulted write data to the array of disk drives.

It should be understood that the remaining storage processor is capable of performing host read and write operations while the remaining storage processor destages the vaulted write data contents and after such destaging is complete. For example, the remaining storage processor now carries out write operations in a write-through manner where the remaining storage processor stores new write data from an external host to the array of disk drives before acknowledging that the write operation is complete.

It should be further understood that the remaining storage processor vaults the contents of its write cache to the magnetic disk drive and disables its write cache so that a second failure will not result in loss of the cached write data. For example, suppose that the remaining storage processor subsequently encounters a software failure after vaulting the write cache to the magnetic disk drive. When the remaining storage processor recovers from the software failure (i.e., performs a soft reset), the remaining storage processor overwrites its local write cache. In particular, Basic Input/Output System (BIOS) firmware directs the remaining storage processor to clear and test its local write cache. Additionally, the remaining storage processor uses at least a portion of its local write cache for temporarily holding Power-On Self Test (POST) code for running a Power-On Self Test. Although the contents of the local write cache have been overwritten, no write data is lost since the previously-cached write data was immediately vaulted to the magnetic disk drive and since all subsequently received write data is processed in a write-through manner. SUMMARY

Write caching is managed. Write-back caching operations are performed using a cache of a storage processor. After a failure of the storage processor, it is determined whether the cache includes a latest valid copy of cache data. One or more implementations of the invention may provide one or more of the following advantages.

Data corruption can be avoided when write cache data is persisted through a reset or reboot. Host data can be protected when one or more storage processors experience failures.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 is a block diagram of a data storage system which is configured to continue write caching after a storage processor failure.

Fig. 2 is a block diagram of a storage processor of the data storage system of Fig. 1. Fig. 3 is a flow diagram of a procedure for use in the data storage system.

DETAILED DESCRIPTION

Conventional vaulting involves transitioning from a write-caching mode when two storage processors are available to a write -through mode when one of the storage processors fails but the other storage processor remains operational. For example, the remaining storage processor requires time to vault contents of its local write cache to a magnetic disk drive. (With respect to magnetic disk drives referenced herein, it is to be understood that other persistent storage such as solid state disks, flash drives, and/or optical disks may be used as well or instead.) During this time, the remaining storage processor and thus the data storage system as a whole is completely or largely unavailable to attend to further write or read operations from external hosts. Accordingly, a write or read operation submitted by an external host during this write cache vaulting event is likely to time out.

Additionally, the process of destaging the vaulted write cache contents from the magnetic disk drive to the array of disk drives may take a considerable amount of time (e.g., several hours). During this time, there is a significant time latency associated with processing a new write operation. In particular, if the remaining storage processor receives new write data from an external host computer, the remaining storage processor first accesses the vaulted write cache contents from the magnetic disk drive to see if the new write operation pertains to any vaulted write data from an earlier write operation in order to maintain data integrity. If such vaulted write data exists and if the vaulted write data must be preserved in the event that the new write is aborted, the remaining storage processor completes the earlier write operation, i.e., writes the vaulted write data to the array of disk drives. Then, the remaining storage processor processes the new write operation in a write-through manner.

Furthermore, after the remaining storage processor has finished destaging the vaulted write cache contents from the magnetic disk drive (or alternatively from a mini array of drives holding the vault) to the array of disk drives, the remaining storage processor continues to operate in write-though mode thus passing on a relatively large performance hit onto the external hosts. That is, the remaining storage processor stores write data onto the array of disk drives prior to acknowledging completion of write operations. Such operation results in significant latency compared to response times for processing write data in write-back mode.

Moreover, there may be instances where the failed storage processor encounters only a minor failure (e.g., a software anomaly resulting in a soft reset). In such a situation, the failed storage processor is able to quickly recover from the failure, e.g., a couple of minutes to reboot and perform self-tests. Nevertheless, to prevent the recovered storage processor from interfering with the vault destaging process, the recovered storage processor cannot rejoin the data storage system until the remaining storage processor has completed destaging the vaulted write cache contents to the array of disk drives.

Accordingly, the recovered storage processor must remain sidelined during the vault destaging process which may take several hours to complete.

In contrast to conventional vaulting which involves transitioning from a write- caching mode when two storage processors are available to a write-through mode when one of the storage processors fails but the other storage processor remains, an enhanced technique for responding to a storage processor failure involves continuing to perform write-back caching operations while the failed storage processor remains unavailable. Such a technique alleviates the need for the remaining storage processor to vault cached write data to a magnetic disk and then destage the vaulted write data in response to the failed storage processor. Furthermore, such a technique provides better response time on new host write operations than write -through caching which is performed following the failure in conventional vaulting. Aspects of the technique are described in U.S. patent application serial no. 1 1/529,124, filed September 28, 2006, entitled RESPONDING TO A STORAGE PROCESSOR FAILURE WITH CONTINUED WRITE CACHING, attorney docket no. EMC-06-329, which is assigned to the same assignee as the present invention, and which is hereby incorporated herein by reference.

In another aspect of the technique as described below, in the event that a storage processor that is performing non-mirrored write caching has a sudden catastrophic failure ("panics"), the storage processor has the ability to recover the contents of the write cache after rebooting from the panic unless the integrity of the write cache image has been compromised, and can determine where the most recent version of the cache image resides on boot up. For example, if it is determined that the other storage processor has the most recent version of the cache image and the other storage processor has not booted yet, the storage processor continues booting without the write cache and loads the write cache image later when the other storage processor has booted. Fig. 1 is a block diagram of a data storage system 20 which is configured to continue write caching on behalf of a set of external host computers 22(1), 22(2), ... (collectively, external hosts 22) after a storage processor failure. The external hosts 22 connect to the data storage system 20 via a respective communications medium 24(1), 24(2), ... (collectively, communications media 24).

The data storage system 20 includes multiple storage processors 26(A), 26(B) (collectively, storage processors 26), a cache mirror bus 28 and a set of disk drives 30(1), ... 30(N) (collectively, disk drives 30). The storage processors 26 are configured to perform data storage operations (e.g., read operations, write operations, etc.) on behalf of the external hosts 22. The cache mirror bus 28 is configured to convey data between caches of the storage processors 26 thus enabling cache mirroring between the storage processors 26. The set of disk drives 30 enables the data storage system 20 to store and retrieve data on behalf of the external hosts 22 in a fault tolerant, non-volatile manner (e.g., using a RAID scheme). Each storage processor 26 is configured to perform write-back caching in response to write operations 32 from the external hosts 22 while both storage processors 26 are in operation. That is, each storage processor 26 acknowledges completion of a write operation 32 once the write data reaches its local write cache and, if possible, once the write data is mirrored in the local write cache of the other storage processor 26. Additionally, each storage processor 26 is configured to continue to perform such write-back caching after a failure of the other storage processor 26. Such operation enables the data storage system 20 to provide improved response times and quicker recovery in the event a storage processor failure.

For example, suppose that the storage processor 26(A) fails for a short period of time (e.g., due to an unanticipated soft reset). The storage processor 26(B) continues to operating under a write-back write policy. Such continued write-back operation alleviates the need to vault the write cache of the storage processor 26(B) which would otherwise make the data storage system 20 unavailable for a period of time. Additionally, such continued write-back operation avoids the performance hit associated with subsequently destaging the vaulted write cache contents to magnetic disk as well as running the data storage system in a write-through mode. Moreover, since the storage processor 26(B) continues in write-back mode, the storage processor 26(A) is capable of easily becoming active again (i.e., rejoining in performance of host-based read and write operations) rather than having to wait until vault destaging is complete which could take several hours. Further details will now be provided with reference to Fig. 2.

Fig. 2 is a block diagram of each storage processor 26 of the data storage system 20. Each storage processor 26 includes a communications interface 40, a controller 42 and a memory subsystem 44. The communications interface 40 includes a host interface 46, a cache mirroring interface 48, and a disk interface 50. The memory subsystem 44 includes a control circuit 52, a local write cache 54 and additional memory 58. The additional memory 58 includes operating system storage, firmware for storing BIOS and POST code, optional flash memory, etc.

The communications interface 40 is configured to provide connectivity from the storage processor 26 to various other components. In particular, the host interface 46 is configured to connect the storage processor 26 to one or more external hosts 22 through the connection media 24 (also see Fig. 1). The cache mirroring interface 48 is configured to connect the storage processor 26 (e.g., the storage processor 26(A)) to another storage processor 26 (e.g., the storage processor 26(B)) to enable cache mirroring through the cache mirroring bus 28. The disk interface 50 is configured to connect the storage processor 26 to the set of disk drives 30.

The controller 42 is configured to carryout data storage operations on behalf of one or more of the external hosts 22 through the communications interface 40 (e.g., see the write operations 32 in Fig. 1). In some arrangements, the controller 42 is implemented as a set of processors running an operating system which is capable of being stored in a designated area on one or more of the disk drives 30. In other arrangements, the controller 42 is implemented as logic circuitry (e.g., Application Specific Integrated Circuitry, Field Programmable Gate Arrays, etc.), microprocessors or processor chip sets, analog circuitry, various combinations thereof, and so on.

The memory subsystem 44 is configured to provide memory services to the controller 42. In particular, the control circuitry 54 of the memory subsystem 54 is configured to provide persistent write caching using the write cache 56, i.e., enable the storage processor 26 to continue write caching even after a storage processor failure. The control circuit 54 is further capable of performing other tasks using the additional memory 58 (e.g., operating a read cache, operating as an instruction cache, optionally vaulting contents of the write cache 56 into non-volatile flash memory or disk drive memory in response to a failure of the controller 42, etc.).

In general, the technique has two parts: the write cache handling of the presence or removal of the other ("peer") storage processor, and the persistence of the single storage processor write cache image through a reboot and subsequent recovery of that image. The first part helps keep the performance at an acceptable level when the peer storage processor is not present. The second part, as described below, helps keep the host data safe in case of a software panic of the storage processor that is single-board write caching.

With reference to Fig. 3, when the write cache is enabled and the peer storage processor panics or is removed, it is determined that the peer storage processor is becoming unavailable (step 3010). Once this determination is made, the write cache can transition to single-board write caching (step 3020). A persistent write cache session ID or other persistent tag is updated to indicate that this storage processor has the most recent copy of the write cache image (step 3030). Throughout this process, the write cache stays enabled and accepts host requests (step 3040). Once the write cache has transitioned to single-board write caching, the write cache can stay in that mode until the peer storage processor has returned (step 3050). When the peer storage processor returns, the current image of the write cache is transferred to the peer storage processor (step 3060), and mirrored write caching is re-enabled when the peer storage processor is ready (step 3070). If the peer storage processor comes up while the storage processor that was in single-board write caching is rebooting (step 3080), the returning peer storage processor detects that the other storage processor should have the most recent copy of the write cache data image (step 3090).

When the write cache is in single-board write caching mode, the write cache still handles all environmental events that were handled otherwise. This includes power issues (power supplies and standby power supply), fans faults, and vault drive faults unless they have been overridden by the user. After the environmental issue has been corrected, the write cache re-enables unless the user has commanded the write cache to be disabled.

When a storage processor is in single-board write caching mode and that storage processor panics, the host application data will be lost absent additional functionality. In order to help minimize the risk to host application data, the contents of the write cache are retained in memory through a reboot or, optionally, saved to a permanent storage device for retrieval after the storage processor reboots. When the storage processor reboots, it can either retrieve the write cache data from the permanent storage or rebuild the write cache from the image in memory. If the peer storage processor comes up while the storage processor that was in single-board write caching is rebooting, the returning peer storage processor detects that the other storage processor should have the most recent copy of the data, and waits for it to reboot.

If the write cache enters the disabled state due to being disabled by the user or by a system fault and the storage processor is in single-board write caching mode, the write cache performs a vault dump if there are any write cache pages not yet written to disk ("dirty write cache pages") still in the write cache. This is done to help ensure that the write cache data is safe if the storage processor is power cycled or removed or rebooted. This is done at least in part to cover service situations in which the surviving storage processor must be rebooted in order to recover the other storage processor.

A key aspect of the technique is determining upon rebooting which one, if any, of the copies of the write cache is the latest valid copy, since both storage processors may have cache data available that has survived reboot. If the wrong copy is selected, data corruption may result. For example, the data in a too-old copy of the cache may not accurately reflect a host write request ("I/O") for which a host has already been sent an acknowledgement, or the data in a too-new copy of the cache may reflect a write I/O that is not yet complete according to the host.

In at least some cases, determining the latest valid copy depends on an analysis of a previous failure of one or both storage processors, which may include a software failure or a hardware failure or both. For example, if a storage processor has a software failure and reboots, or has a hardware failure and is not running at all, it is clear that the other storage processor has the latest valid copy.

The session ID tag is used to tag a copy of cache data in connection with a significant event so that later its contents can be validated and/or distinguished from another copy of cache data. For example, when a storage processor recognizes that the other storage processor is experiencing a failure, the storage processor updates its own session ID tag. If the storage processor reboots shortly thereafter, it can read its own session ID tag and the other storage processor's session ID tag, if available, and determine which copy of the cache data is the latest valid copy.

In another example, if a storage processor panics (causing the other storage processor to update its session ID tag) and is in the midst of rebooting when the other storage processor has a sudden hardware failure after having accepted I/O, the storage processor can finish rebooting and determine from the session ID tags that the other storage processor's copy of cache data is the latest valid copy.

If there is no session ID tag available, the storage processor determines that neither cache is a valid last copy. When a tag is updated the tag is written to vault drives as well, so that after a storage processor has booted it can read the tag from the drives to help determine whether either cache is a valid last copy. In normal operating conditions, both storage processors have the same tag, and cache data is mirrored between the storage processors. In at least one implementation the tag is updated (e.g., incremented) and written to disk drives on any significant event, e.g., when the other storage processor has failed or needs to reboot, when the cache size changes, when the cache page size changes, and/or when the cache is disabled and reenabled. When the tag is to be updated, I/O is not accepted on board until the new tag is in place.

Advantageously, the technique provides improved storage system performance over conventional systems when the system is operating in a degraded mode (e.g., one of the storage processors is inoperative due to a non disruptive upgrade, or is rebooting, in a panic, or has been removed), by keeping the write cache enabled. In order to reduce the risk of losing write cache data in the event of a storage processor software panic while single storage processor ("single-board") write caching is enabled, the write cache can load a write cache image from either the vault on disk, the other storage processor, or can detect that the write cache memory has been persisted through a reboot. Now described are several common panic/failure scenarios and how they are handled in accordance with the technique.

A single storage processor ("SP") panic or reboot situation involves one of the storage processors getting rebooted or panicking. When one of the storage processors panics or is rebooted, the surviving storage processor detects that the other storage processor has stopped responding and transitions from mirrored write caching to single- board write caching. The surviving SP must also get a new session ID which is persisted outside of the vault space on the magnetic disks. The new session ID is also saved in the write cache memory so that it can be persisted in case the surviving SP should panic. When the peer SP (i.e., other SP) has rebooted and is coming up, it determines whether it or the surviving SP or the vault on disk has the most recent write cache image by comparing the session IDs from each location. If (as expected in this scenario) the surviving SP has the most recent copy, the surviving SP sends the write cache image to the peer SP over a cache mirroring connection (e.g., interface 48). Once this is done, both SPs can transition back to mirrored write caching. A single SP failure case is very similar to the single SP panic case except that the rebooted SP determines that it does not have a valid write cache image and therefore the surviving SP or the vault has the valid write cache image.

A staggered dual SP panic case occurs when one SP panics, and before the first SP reboots the surviving SP panics. Until the surviving SP panics, the behavior is the same as in the single SP panic case. When the surviving SP panics, the valid write cache image is still in that SP 's memory. The first SP to boot finds that neither its memory image nor the vault contains the valid write cache image. In this case, that SP keeps the write cache disabled. The second SP to boot (the original surviving SP) finds that it has the valid write cache image persisted in its memory. The second SP communicates to the first SP that it has the valid write cache image and sends it to the first SP. Now the write cache can transition back to mirrored write caching.

A dual simultaneous SP panic can occur when both SPs panic simultaneously or when the second SP panics before detecting that the first SP had already panicked. In this case, both SPs have a valid copy of the write cache image and therefore the same session ID. A distributed lock helps ensure that only one SP boots the data storage system operating system at a time, so the SP that comes up first is declared, by definition, to have the master copy. When the second SP is allowed to boot, the first SP sends over its copy of the write cache image for consistency. Rolling or alternating panics are also handled. Alternating panics are a repeated form of the single SP panic with the surviving SP alternating between the two SPs. Rolling panics are slightly different. In a rolling panic situation, if the SP that panics does not boot up enough to load the write cache image from the surviving SP, this becomes a single SP panic. If the SP that panics boots up enough to load the write cache image from the surviving SP and then panics, this becomes a repeated single SP panic. In all such cases, this type of panic does not cause the loss of data.

When an AC power failure or over temperature or multi-fan fault is detected, the write cache is notified and the write cache dumps the current write cache image to the vault drives. The SP with the current write cache image performs the dump even if both SPs are functional.

Non-SP hardware faults include power supply faults, SPS faults, single fan faults and single or multiple vault drive faults. In accordance with the technique, when the technique is enabled, the write cache ignores these faults when the write cache availability enhancement is enabled. While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

CLAIMSWhat is claimed is:

1. A method for use in managing write caching, the method comprising: performing write-back caching operations using a cache of a storage processor; and after a failure of the storage processor, determining whether the cache includes a latest valid copy of cache data.

2. The method of claim 1, further comprising: persisting the cache data through the failure of the storage processor.

3. The method of claim 1, further comprising: determining whether another cache of another storage processor includes a latest valid copy of cache data.

4. The method of claim 1, further comprising: using a session ID in the determination of whether the cache includes a latest valid copy of cache data.

5. A system for use in managing write caching, the system comprising: first logic performing write-back caching operations using a cache of a storage processor; and second logic determining, after a failure of the storage processor, determining whether the cache includes a latest valid copy of cache data.

6. The system of claim 5, further comprising: third logic persisting the cache data through the failure of the storage processor.

7. The system of claim 5, further comprising: third logic determining whether another cache of another storage processor includes a latest valid copy of cache data.

8. The system of claim 5, further comprising: third logic using a session ID in the determination of whether the cache includes a latest valid copy of cache data.