US20240020208A1 - Multiple drive failure data recovery - Google Patents
Multiple drive failure data recovery Download PDFInfo
- Publication number
- US20240020208A1 US20240020208A1 US17/862,694 US202217862694A US2024020208A1 US 20240020208 A1 US20240020208 A1 US 20240020208A1 US 202217862694 A US202217862694 A US 202217862694A US 2024020208 A1 US2024020208 A1 US 2024020208A1
- Authority
- US
- United States
- Prior art keywords
- storage
- workload
- requests
- array
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000011084 recovery Methods 0.000 title description 4
- 230000004044 response Effects 0.000 claims abstract description 26
- 238000000034 method Methods 0.000 claims description 52
- 230000008685 targeting Effects 0.000 claims description 12
- 230000000116 mitigating effect Effects 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 18
- 238000007726 management method Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 9
- 238000012790 confirmation Methods 0.000 description 6
- 238000012005 ligant binding assay Methods 0.000 description 6
- 238000003491 array Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000002085 persistent effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 239000000835 fiber Substances 0.000 description 3
- 239000000969 carrier Substances 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- BASFCYQUMIYNBI-UHFFFAOYSA-N platinum Chemical compound [Pt] BASFCYQUMIYNBI-UHFFFAOYSA-N 0.000 description 2
- 229910000906 Bronze Inorganic materials 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000010974 bronze Substances 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- KUNSUQLRTQLHQQ-UHFFFAOYSA-N copper tin Chemical compound [Cu].[Sn] KUNSUQLRTQLHQQ-UHFFFAOYSA-N 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 229910052697 platinum Inorganic materials 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 125000000391 vinyl group Chemical group [H]C([*])=C([H])[H] 0.000 description 1
- 229920002554 vinyl polymer Polymers 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1666—Error detection or correction of the data by redundancy in hardware where the redundant component is memory or memory area
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2094—Redundant storage or storage space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/10—Program control for peripheral devices
- G06F13/102—Program control for peripheral devices where the programme performs an interfacing function, e.g. device driver
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/85—Active fault masking without idle spares
Definitions
- a storage array is a data storage system for block-based storage, file-based storage, or object storage. Rather than store data on a server, storage arrays can include multiple storage devices (e.g., drives) to store vast amounts of data. In addition, storage arrays can include a central management system that manages the data and delivers one or more distributed storage services for an organization. For example, a financial institution can use storage arrays to collect and store financial transactions from local banks (e.g., bank account deposits/withdrawals). Occasionally, a storage array can experience certain events (e.g., power loss, hardware failure, etc.), resulting in data loss.
- events e.g., power loss, hardware failure, etc.
- a method includes receiving an input/output (IO) workload by a storage array. Additionally, the method includes relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure.
- IO input/output
- the method can also include detecting two or more storage device failures while the storage array can receive the IO workload.
- the method can also include determining whether a drain event has activated in response to the storage device failures.
- the method can also include identifying each storage drive related to a drain of the RAID group's healthy drives. Each identified storage drive can replace the current storage drives assigned to the RAID group.
- the method can also include reallocating the cached write pending requests to at least one of the storage drive replacements.
- the method can also include identifying the two or more storage device failures belonging to a specific redundant array of independent disks (RAID) group of a plurality of RAID groups.
- RAID redundant array of independent disks
- the method can also include identifying at least one of the IO requests targeting the two or more failed storage devices.
- the method can also include determining whether at least one IO request is a write pending request cached in one or more memory cache slots. Further, the method can include anticipating receiving additional IO requests from the IO workload targeting the two or more failed storage devices.
- the method can also include identifying one or more cache slots corresponding to the write pending request being partially filled.
- the method can also include writing data to each empty data block of the partially filled cache slots.
- a system is configured to receive an input/output (IO) workload by a storage array. Additionally, the system is configured to relocate the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure.
- IO input/output
- the system can also be configured to detect two or more storage device failures while the storage array receives the IO workload.
- the system can also be configured to determine whether a drain event has activated in response to a storage device failure.
- the system can also be configured to identify each storage drive related to a drain of the RAID group's healthy drives. Each identified storage drive can replace the current storage drives assigned to the RAID group.
- the system can also be configured to reallocate the cached write pending requests to at least one of the storage drive replacements.
- the system can also be configured to identify the two or more storage device failures belonging to a specific redundant array of independent disks (RAID) group of a plurality of RAID groups.
- RAID redundant array of independent disks
- the system can also be configured to identify at least one of the IO requests targeting the two or more failed storage devices.
- the system can also be configured to determine whether each IO request is a write pending request cached in one or more memory cache slots. Further, the system can anticipate receiving additional IO requests from the IO workload targeting the two or more failed storage devices.
- the system can also be configured to identify one or more cache slots corresponding to the write pending request being partially filled.
- the system can also be configured to write data to each empty data block of the partially filled cache slots.
- FIG. 1 illustrates a distributed network environment that includes a storage array in accordance with embodiments of the present disclosure.
- FIG. 2 is a cross-sectional view of a storage device in accordance with embodiments of the present disclosure.
- FIG. 3 A is a communication block diagram in accordance with embodiments of the present disclosure.
- FIG. 3 B is a block diagram of cache memory in accordance with embodiments of the present disclosure.
- FIG. 4 is a flow diagram of a method for mitigating data loss in accordance with embodiments of the present disclosure.
- FIG. 5 is a flow diagram of a method for preserving write pending (WP) data due to drive(s) failure in accordance with example embodiments of the present disclosure.
- a storage array can include multiple storage devices that can store vast amounts of data.
- a storage array can include a management system that manages its memory, storage devices, processing resources (e.g., processors), data, and the like to deliver hosts (e.g., client machines) remote/distributed storage.
- the management system can logically group one or more storage drives or portion(s) thereof to establish a virtual storage device.
- the management system can form a RAID (redundant array of independent disks) group using a set of the array's storage devices.
- the management system can logically segment the RAID group's corresponding storage devices to establish the virtual storage device. For instance, the management system can logically segment the storage devices to enable data striping. Specifically, data striping includes segmenting logically sequential data, such as a file, so consecutive segments are physically stored on different RAID group storage devices. Additionally, the management system can store parity information corresponding to each segment on a single storage device (parity drive) or across the different RAID group storage devices.
- a RAID group can include data (D) member devices (D) and parity member devices (P).
- the D member devices can store data corresponding to input/output (IO) write requests from one or more hosts.
- the P member devices can store the parity information such as “exclusive-ORs” (XORs) of the data stored on the D member devices.
- the management system can use the parity information to recover data if a D member device fails.
- a storage array can issue storage confirmation responses to a host (e.g., a computing device) from which it receives an IO data write request.
- the storage array's performance can be a function of the time the host takes to receive the confirmation response. Because writing data to a storage device can be slow, the storage array's response time can be greater than the host-requirement performance.
- RAID techniques can include policies that improve a storage array's performance (e.g., response times). Specifically, the policies can instruct a storage array to return storage confirmation responses after caching IO write data (e.g., on volatile memory) but before writing it to one or more physical storage devices. However, such techniques can cause the storage array to issue false confirmation responses.
- a host can issue an IO write request contemporaneous to a drive failure.
- the storage array can determine that the IO request's target storage drive corresponds to the failed drive after it has already cached the IO and sent a confirmation response. Consequently, the storage array may be unable to destage the cached data to a storage drive, causing it to lose the data.
- Embodiments of the present disclosure relate to mitigating such data loss as described in greater detail herein.
- a distributed network environment 100 can include a storage array 102 , a remote system 140 , and hosts 134 .
- the storage array 102 can include components 104 that perform one or more distributed file storage services.
- the storage array 102 can include one or more internal communication channels 112 like Fibre channels, busses, and communication modules that communicatively couple the components 104 .
- the storage array 102 , components 104 , and remote system 140 can include a variety of proprietary or commercially available single or multi-processor systems (e.g., parallel processor systems).
- the single or multi-processor systems can include central processing units (CPUs), graphical processing units (GPUs), and the like.
- the storage array 102 , remote system 140 , and hosts 134 can virtualize one or more of their respective physical computing resources (e.g., processors (not shown), memory 114 , and storage devices 128 ).
- the storage array 102 and, e.g., one or more hosts 134 can establish a network 132 .
- the storage array 102 and a remote system 140 can establish a remote network (RN 138 ).
- the network 132 or the RN 138 can have a network architecture that enables networked devices to send/receive electronic communications using a communications protocol.
- the network architecture can define a storage area network (SAN), local area network (LAN), wide area network (WAN) (e.g., the Internet), and Explicit Congestion Notification (ECN), Enabled Ethernet network, and the like.
- SAN storage area network
- LAN local area network
- WAN wide area network
- ECN Explicit Congestion Notification
- the communications protocol can include a Remote Direct Memory Access (RDMA), TCP, IP, TCP/IP protocol, SCSI, Fibre Channel, Remote Direct Memory Access (RDMA) over Converged Ethernet (ROCE) protocol, Internet Small Computer Systems Interface (iSCSI) protocol, NVMe-over-fabrics protocol (e.g., NVMe-over-ROCEv2 and NVMe-over-TCP), and the like.
- RDMA Remote Direct Memory Access
- TCP IP
- TCP/IP protocol SCSI
- Fibre Channel Remote Direct Memory Access
- RDMA Remote Direct Memory Access
- RDMA Remote Direct Memory Access
- CCE Remote Direct Memory Access
- iSCSI Internet Small Computer Systems Interface
- NVMe-over-fabrics protocol e.g., NVMe-over-ROCEv2 and NVMe-over-TCP
- the storage array 102 can connect to the network 132 or RN 138 using one or more network interfaces.
- the network interface can include a wired/wireless connection interface, bus, data link, and the like.
- a host adapter (HA) 106 e.g., a Fibre Channel Adapter (FA) and the like, can connect the storage array 102 to the network 132 (e.g., SAN).
- a remote adapter (RA) 130 can also connect the storage array 102 to the RN 138 .
- the network 132 and RN 138 can include communication mediums and nodes that link the networked devices.
- communication mediums can include cables, telephone lines, radio waves, satellites, infrared light beams, etc.
- the communication nodes can include switching equipment, phone lines, repeaters, multiplexers, and satellites.
- the network 132 or RN 138 can include a network bridge that enables cross-network communications between, e.g., the network 132 and RN 138 .
- hosts 134 connected to the network 132 can include client machines 136 a - 136 b , running one or more applications.
- the applications can require one or more of the storage array's services.
- each application can send one or more input/output (IO) messages (e.g., a read/write request or other storage service-related request) to the storage array 102 over the network 132 .
- IO messages can include metadata defining performance requirements according to a service level agreement (SLA) between hosts 134 and the storage array provider.
- SLA service level agreement
- the storage array 102 can include a memory 114 such as volatile or nonvolatile memory.
- volatile and nonvolatile memory can include random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), and the like.
- each memory type can have distinct performance characteristics (e.g., speed corresponding to reading/writing data).
- the types of memory can include register, shared, constant, user-defined, and the like.
- the memory 114 can include global memory (GM 118 ) that can cache IO messages and their respective data payloads.
- the memory 114 can include local memory (LM 118 ) that stores instructions that the storage array's processor(s) can execute to perform one or more storage-related services.
- GM 118 global memory
- LM 118 local memory
- the storage array 102 can deliver its distributed storage services using storage devices 128 .
- the storage devices 126 can include multiple thin-data devices (TDATs) such as persistent storage devices 128 a - 128 c .
- TDATs thin-data devices
- each TDAT can have distinct performance capabilities (e.g., read/write speeds) like hard disk drives (HDDs) and solid-state drives (SSDs).
- HDDs hard disk drives
- SSDs solid-state drives
- the storage array 102 can include an Enginuity Data Services processor (EDS) 108 that performs one or more memory and storage self-optimizing operations (e.g., one or more machine learning techniques). Specifically, the operations can implement techniques that deliver performance, resource availability, data integrity services, and the like based on the SLA and the performance characteristics (e.g., read/write times) of the array's memory 114 and storage devices 126 .
- the EDS 108 can deliver hosts 134 (e.g., client machines 136 a - 136 b ) remote/distributed storage services by virtualizing the storage array's memory/storage resources (memory 114 and storage devices 126 , respectively).
- hosts 134 e.g., client machines 136 a - 136 b
- the storage array 102 can also include a controller 110 (e.g., management system controller) that can reside externally from or within the storage array 102 and one or more of its component 104 .
- the controller 110 can communicate with the storage array 102 using any known communication connections.
- the communications connections can include a serial port, parallel port, network interface card (e.g., Ethernet), etc.
- the controller 110 can include logic/circuitry that performs one or more storage-related services.
- the controller 110 can have an architecture designed to manage the storage array's computing, storage, and memory resources as described in greater detail herein.
- the storage array 102 can include an EDS 108 that virtualizes the array's storage devices 126 .
- the EDS 108 can provide a host, e.g., client machine 138 a , with a virtual storage device (e.g., thin-device (TDEV)) that logically represents one or more of the storage array's memory/storage resources or physical slices/portions thereof.
- TDEV thin-device
- the EDS 108 can provide each TDEV with a unique identifier (ID) like a target ID (TID).
- EDS 108 can map each TID to its corresponding TDEV using a logical unit number (LUN) (e.g., a pointer to the TDEV).
- LUN logical unit number
- the storage devices 126 can include an HDD 202 with stacks of cylinders 204 .
- each cylinder 204 can include one or more tracks 206 .
- Each track 206 can include continuous sets of physical address spaces representing each of its sectors 208 (e.g., slices or portions thereof).
- the EDS 108 can provide each slice/portion with a corresponding logical block address (LBA).
- the EDS 108 can group sets of continuous LBAs to establish a virtual storage device (e.g., TDEV).
- each TDEV can include LBAs corresponding to one or more of the storage devices 128 or portions thereof.
- the storage devices 126 can have distinct performance capabilities.
- an HDD architecture is known by skilled artisans to be slower than an SSD's architecture.
- the array's memory 114 can include different memory types, each with distinct performance characteristics described herein.
- the EDS 108 can establish a storage or memory hierarchy based on the SLA and the performance characteristics of the array's memory/storage resources.
- the SLA can include one or more Service Level Objectives (SLOs) specifying performance metric ranges (e.g., response times and uptimes) corresponding to the hosts' performance requirements.
- SLOs Service Level Objectives
- the SLO can specify service level (SL) tiers corresponding to each performance metric range and categories of data importance (e.g., critical, high, medium, low).
- SL service level
- the SLA can map critical data types to an SL tier requiring the fastest response time.
- the storage array 102 can allocate the array's memory/storage resources based on an IO workload's anticipated volume of IO messages associated with each SL tier and the memory hierarchy.
- the EDS 108 can establish the hierarchy to include one or more tiers (e.g., subsets of the array's storage and memory) with similar performance capabilities (e.g., response times and uptimes).
- the EDS 108 can establish fast memory and storage tiers to service host-identified critical and valuable data (e.g., Platinum, Diamond, and Gold SLs).
- host-identified critical and valuable data e.g., Platinum, Diamond, and Gold SLs
- slow memory and storage tiers can service host-identified non-critical and less valuable data (e.g., Silver and Bronze SLs).
- the EDS 108 can define “fast” and “slow” performance metrics based on relative performance measurements of the array's memory 114 and storage devices 126 .
- the fast tiers can include memory 114 and storage devices 126 with relative performance capabilities exceeding a first threshold.
- slower tiers can include memory 114 and storage devices 128 , with relative performance capabilities falling below a second threshold.
- the first and second thresholds can correspond to the same threshold.
- the EDS 108 can establish logical tracks (e.g., track identifiers (TIDs) by creating LBA groups that include LBAs corresponding to any storage devices 126 .
- the EDS 108 can establish a virtual storage device (e.g., a logical unit number (LUN)) by creating TID groups.
- the EDS 108 can generate a searchable data structure, mapping logical storage representations to their corresponding physical address spaces.
- the HA 106 can present the hosts 134 with the logical memory and storage representations based on host or application performance requirements.
- the storage array 102 can include a controller 110 that includes logic/circuitry configured to perform one or more memory and storage management techniques.
- the controller 110 can establish one or more virtual storage devices 304 as described above.
- the virtual storage devices 304 can include thin devices such as TDEV 306 .
- the controller 110 can virtualize portions and one or more storage devices 126 to establish, e.g., the TDEV 306 .
- the controller 110 can establish a redundant array of independent disks (RAID) storage group (RG) such as RG 308 using one or more of the storage devices 126 .
- the controller 110 can establish RG 308 using storage volumes 310 a - 310 d selected from the storage devices 126 . Accordingly, the controller 110 can establish the TDEV 306 using the RG 308 .
- the RG 308 can include data members (D) and parity members (P).
- the D-members can store data
- the P-members can store parity information (e.g., XORs of the data).
- the controller 110 and the RG members e.g., physical storage volumes 310 a - 310 d ) can access the parity information to discover information corresponding to each member's stored data. Accordingly, the parity information allows the controller 110 to distribute data across all the RG members 310 a - 310 d and recover data if one or more D-members 310 a - 310 d fail.
- the storage array 102 can include IO workloads having IO write requests targeting the TDEV 306 .
- the controller 110 can cache such IO write requests in a cache 330 corresponding to GM 116 .
- the cache 330 can include cache slots 318 corresponding to portions of the GM 116 .
- each cache slot 332 can correspond to an RG member's track (e.g., track 206 ).
- the controller 110 can only destage a cache slot 332 once it is filled.
- the controller 110 can assign the RG 308 with memory resources. For example, the controller 110 can analyze metadata from an IO workload's corresponding IO requests.
- the metadata can include information like IO size, IO type, a TID/LUN, and performance requirements, amongst other related information.
- the controller 110 can generate workload models to form predictions corresponding to IOs targeting TDEV 306 .
- the controller 110 can map cache slots 316 to corresponding RG member slices (e.g., sector 208 of FIG. 2 ). Accordingly, the controller 110 can obtain a TID, LBA, or LUN from an IO request's metadata and cache the IO request and its payload in one or more of the cache slots 316 corresponding to the TID, LBA, or LUN.
- hosts 134 use the storage array 102 as a remote/distributed persistent data storage solution.
- writing data to one or more of the physical storage devices 126 can span a duration that fails to satisfy an SLA.
- the controller 110 can issue the HA 106 instructions to send the hosts 134 a storage confirmation when an IO and its payload are cached but before they are destaged to persistent physical storage (e.g., the RG's corresponding physical storage volumes 310 a - 310 d ).
- a RAID group like RG 308 can experience multiple storage volume failures causing the controller 110 to flag each failed volume as not ready (NR). For example, such a failure can result in the RG 308 having NR members 310 a , 310 d , and healthy members 310 b - 310 c .
- the controller 110 can perform one or more operations to recover storage services for the TDEV 306 . Specifically, the controller can provide the TDEV 306 with a new RG using one or more of the array's available storage devices 126 . Further, the controller 110 can perform drain techniques 326 to migrate data from the healthy members 310 b - 310 c to one or more of the available storage devices 126 allocated to the new RG.
- the failure experienced by the RG 308 can occur before corresponding cached IO write requests and data payloads can be destaged to persistent physical storage.
- the storage array 102 can receive and cache additional IO requests targeting the RG 308 and its NR members 310 a , 310 b while the controller performs the drain 326 .
- one or more of the cache slots 316 can correspond to a partially filled write-pending track 206 of one of the NR members 310 a , 310 d .
- the controller 110 can assign cache slot 332 to cache data corresponding to a track (e.g., track 206 of FIG. 2 ) from NR member 310 a .
- the cache slot 332 can include filled cache blocks 314 , and empty cache blocks 312 corresponding to the track from NR member 310 a . Consequently, current na ⁇ ve approaches discard the write pending data from the cache slot 332 , resulting in data loss. In contrast, the controller 110 can perform cache recovery operations 328 to prevent such data loss, as described in greater detail herein.
- the controller 110 can include logic/circuitry designed to perform the cache recovery operations 328 .
- the cache recovery operations 328 can include techniques that prevent data from partially filled write-pending cached track of a RAID group's NR member from becoming lost.
- the storage array 102 can include one or more daemons 334 that can monitor the array's components 104 .
- the daemons 334 can establish a link 338 to the array's components 104 to monitor the storage devices 126 and cache 330 .
- the daemons 334 can record events corresponding to the storage devices 128 and cache 330 in one or more activity logs. Additionally, the daemons 334 can record each component's global ready state from each component's device header.
- the controller 110 can obtain the activity logs to identify each RG 308 and its corresponding member states. Accordingly, the controller 110 can periodically or randomly perform a read of the activity logs to identify device states. In response to identifying the NR members 310 a , 310 d of RG 308 , the controller 110 can perform one or more operations to recover storage services for the TDEV 306 . For example, the controller 110 can perform a drain 326 of the healthy members 310 b - 310 c as described above. Additionally, the controller 110 can identify any cache slots 318 , including filled cached data blocks corresponding to one or more RG members 308 .
- the controller 110 can identify cache slots, like cache slot 332 , corresponding to a partially filled WP (write-pending) track 318 of, e.g., NR member 310 a .
- the cache slot 332 can include empty cache blocks 312 , and NR-related filled cache blocks 314 .
- the empty cache blocks 312 can correspond to sectors 0-7, and the NR-related filled cache blocks 314 can correspond to sectors 8-9, A-F of the partial WP cached tracks 318 .
- the empty cache blocks 312 can correspond to respective sets of continuous LBAs 320 of the TDEV 306 .
- each set of contiguous LBAs 320 can correspond to a sector of the NR member 310 a .
- each LBA can correspond to data 322 or a portion thereof stored by the TDEV 306 .
- the LBAs 320 can include metadata 324 with information corresponding to the data 322 .
- an LBA's metadata 324 can define its related physical address space corresponding to the NR member 310 a.
- the controller 110 can generate fake data using a data generator (not shown). For instance, the controller 110 can provide the data generator with a total size corresponding to the empty cache blocks 312 so it can generate the fake data with a size corresponding to the empty cache blocks 312 .
- the generator can provide the controller 110 with a string of zeros to fill the empty cache blocks 312 .
- the controller 110 can flag the cache slot 332 as a filled WP cached track. Accordingly, the controller 110 can further destage the now filled WP cached tracks to the new RG established for the TDEV 306 . Once each partially filled cache slot corresponding to RG 308 is filled and destaged, the controller 110 can flag the TDEV's new RG as ready (e.g., healthy).
- one or more of the array's components 104 can execute a method 400 that includes acts to mitigate data loss resulting from storage device failures.
- the method 400 at 402 , can include receiving an input/output (IO) workload by a storage array.
- method 400 can also include relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure.
- each act (e.g., step or routine) of the method 400 can include any combination of techniques described herein.
- one or more of the array's components 104 can execute a method 500 that includes acts to preserve data in response to a drive failure.
- the method 500 at 502 , can include receiving an input/output (IO) workload by a storage array.
- method 500 can include relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure.
- method 500 at 506 , can include determining whether a drain event has activated in response to the two or more storage device failures.
- method 500 can include identifying at least one of the IO requests targeting the two or more failed storage devices.
- each act (e.g., step or routine) of the method 500 can include any combination of techniques described herein.
- the implementation can be a computer program product. Additionally, the implementation can include a machine-readable storage device for execution by or to control the operation of a data processing apparatus.
- the implementation can, for example, be a programmable processor, a computer, or multiple computers.
- a computer program can be in any programming language, including compiled or interpreted languages.
- the computer program can have any deployed form, including a stand-alone program, subroutine, element, or other units suitable for a computing environment.
- One or more computers can execute a deployed computer program.
- One or more programmable processors can perform the method steps by executing a computer program to perform the concepts described herein by operating on input data and generating output.
- An apparatus can also perform the method steps.
- the apparatus can be a special purpose logic circuitry.
- the circuitry is an FPGA (field-programmable gate array) or an ASIC (application-specific integrated circuit).
- Subroutines and software agents can refer to portions of the computer program, the processor, the special circuitry, software, or hardware that implements that functionality.
- processors suitable for executing a computer program include, by way of example, both general and special purpose microprocessors and any one or more processors of any digital computer.
- a processor can receive instructions and data from a read-only memory, a random-access memory, or both.
- a computer's essential elements are a processor for executing instructions and one or more memory devices for storing instructions and data.
- a computer can receive data from or transfer data to one or more mass storage device(s) for storing data (e.g., magnetic, magneto-optical disks, solid-state drives (SSDs, or optical disks).
- Data transmission and instructions can also occur over a communications network.
- Information carriers that embody computer program instructions and data include all nonvolatile memory forms, including semiconductor memory devices.
- the information carriers can, for example, be EPROM, EEPROM, flash memory devices, magnetic disks, internal hard disks, removable disks, magneto-optical disks, CD-ROM, or DVD-ROM disks.
- the processor and the memory can be supplemented by or incorporated into special purpose logic circuitry.
- a computer having a display device that enables user interaction can implement the above-described techniques such as a display, keyboard, mouse, or any other input/output peripheral.
- the display device can, for example, be a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor.
- CTR cathode ray tube
- LCD liquid crystal display
- the user can provide input to the computer (e.g., interact with a user interface element).
- other kinds of devices can provide for interaction with a user.
- Other devices can, for example, be feedback provided to the user in any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback).
- Input from the user can, for example, be in any form, including acoustic, speech, or tactile input.
- a distributed computing system with a back-end component can also implement the above-described techniques.
- the back-end component can, for example, be a data server, a middleware component, or an application server.
- a distributing computing system with a front-end component can implement the above-described techniques.
- the front-end component can, for example, be a client computer having a graphical user interface, a Web browser through which a user can interact with an example implementation, or other graphical user interfaces for a transmitting device.
- the system's components can interconnect using any form or medium of digital data communication (e.g., a communication network). Examples of communication network(s) include a local area network (LAN), a wide area network (WAN), the Internet, wired network(s), or wireless network(s).
- the system can include a client(s) and server(s).
- the client and server e.g., a remote server
- the client and server can interact through a communication network.
- a client and server relationship can arise by computer programs running on the respective computers and having a client-server relationship.
- the system can include a storage array(s) that delivers distributed storage services to the client(s) or server(s).
- Packet-based network(s) can include, for example, the Internet, a carrier internet protocol (IP) network (e.g., local area network (LAN), wide area network (WAN), campus area network (CAN), metropolitan area network (MAN), home area network (HAN)), a private IP network, an IP private branch exchange (IPBX), a wireless network (e.g., radio access network (RAN), 802.11 network(s), 802.18 network(s), general packet radio service (GPRS) network, HiperLAN), or other packet-based networks.
- IP IP
- RAN radio access network
- GPRS general packet radio service
- HiperLAN HiperLAN
- Circuit-based network(s) can include, for example, a public switched telephone network (PSTN), a private branch exchange (PBX), a wireless network, or other circuit-based networks.
- PSTN public switched telephone network
- PBX private branch exchange
- wireless network(s) can include RAN, Bluetooth, code-di
- the transmitting device can include, for example, a computer, a computer with a browser device, a telephone, an IP phone, a mobile device (e.g., cellular phone, personal digital assistant (P.D.A.) device, laptop computer, electronic mail device), or other communication devices.
- the browser device includes, for example, a computer (e.g., desktop computer, laptop computer) with a world wide web browser (e.g., Microsoft Internet Explorer® and Mozilla®).
- the mobile computing device includes, for example, a Blackberry®.
- Comprise include, or plural forms of each are open-ended, include the listed parts, and contain additional unlisted elements. Unless explicitly disclaimed, the term ‘or’ is open-ended and includes one or more of the listed parts, items, elements, and combinations thereof.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
One or more aspects of the present disclosure relate to mitigating data loss resulting from storage device failures. In embodiments, an input/output (IO) workload can be received by a storage array. Further, the IO workload's corresponding IO requests stored in the storage array's cache can be relocated in response to a storage device failure.
Description
- A storage array is a data storage system for block-based storage, file-based storage, or object storage. Rather than store data on a server, storage arrays can include multiple storage devices (e.g., drives) to store vast amounts of data. In addition, storage arrays can include a central management system that manages the data and delivers one or more distributed storage services for an organization. For example, a financial institution can use storage arrays to collect and store financial transactions from local banks (e.g., bank account deposits/withdrawals). Occasionally, a storage array can experience certain events (e.g., power loss, hardware failure, etc.), resulting in data loss.
- In one aspect, a method includes receiving an input/output (IO) workload by a storage array. Additionally, the method includes relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure.
- In embodiments, the method can also include detecting two or more storage device failures while the storage array can receive the IO workload.
- In embodiments, the method can also include determining whether a drain event has activated in response to the storage device failures.
- In embodiments, the method can also include identifying each storage drive related to a drain of the RAID group's healthy drives. Each identified storage drive can replace the current storage drives assigned to the RAID group.
- In embodiments, the method can also include reallocating the cached write pending requests to at least one of the storage drive replacements.
- In embodiments, the method can also include identifying the two or more storage device failures belonging to a specific redundant array of independent disks (RAID) group of a plurality of RAID groups.
- In embodiments, the method can also include identifying at least one of the IO requests targeting the two or more failed storage devices.
- In embodiments, the method can also include determining whether at least one IO request is a write pending request cached in one or more memory cache slots. Further, the method can include anticipating receiving additional IO requests from the IO workload targeting the two or more failed storage devices.
- In embodiments, the method can also include identifying one or more cache slots corresponding to the write pending request being partially filled.
- In embodiments, the method can also include writing data to each empty data block of the partially filled cache slots.
- In one aspect, a system is configured to receive an input/output (IO) workload by a storage array. Additionally, the system is configured to relocate the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure.
- The system can also be configured to detect two or more storage device failures while the storage array receives the IO workload.
- The system can also be configured to determine whether a drain event has activated in response to a storage device failure.
- The system can also be configured to identify each storage drive related to a drain of the RAID group's healthy drives. Each identified storage drive can replace the current storage drives assigned to the RAID group.
- The system can also be configured to reallocate the cached write pending requests to at least one of the storage drive replacements. Other technical features can be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
- The system can also be configured to identify the two or more storage device failures belonging to a specific redundant array of independent disks (RAID) group of a plurality of RAID groups.
- The system can also be configured to identify at least one of the IO requests targeting the two or more failed storage devices.
- The system can also be configured to determine whether each IO request is a write pending request cached in one or more memory cache slots. Further, the system can anticipate receiving additional IO requests from the IO workload targeting the two or more failed storage devices.
- The system can also be configured to identify one or more cache slots corresponding to the write pending request being partially filled.
- The system can also be configured to write data to each empty data block of the partially filled cache slots.
- Other technical features can be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
-
FIG. 1 illustrates a distributed network environment that includes a storage array in accordance with embodiments of the present disclosure. -
FIG. 2 is a cross-sectional view of a storage device in accordance with embodiments of the present disclosure. -
FIG. 3A is a communication block diagram in accordance with embodiments of the present disclosure. -
FIG. 3B is a block diagram of cache memory in accordance with embodiments of the present disclosure. -
FIG. 4 is a flow diagram of a method for mitigating data loss in accordance with embodiments of the present disclosure. -
FIG. 5 is a flow diagram of a method for preserving write pending (WP) data due to drive(s) failure in accordance with example embodiments of the present disclosure. - Organizations often use storage arrays to store data. For example, a financial institution can use storage arrays to store banking account information, deposits, withdrawals, loan information, and other related data. Specifically, a storage array can include multiple storage devices that can store vast amounts of data. Additionally, a storage array can include a management system that manages its memory, storage devices, processing resources (e.g., processors), data, and the like to deliver hosts (e.g., client machines) remote/distributed storage. For example, the management system can logically group one or more storage drives or portion(s) thereof to establish a virtual storage device. Specifically, the management system can form a RAID (redundant array of independent disks) group using a set of the array's storage devices.
- The management system can logically segment the RAID group's corresponding storage devices to establish the virtual storage device. For instance, the management system can logically segment the storage devices to enable data striping. Specifically, data striping includes segmenting logically sequential data, such as a file, so consecutive segments are physically stored on different RAID group storage devices. Additionally, the management system can store parity information corresponding to each segment on a single storage device (parity drive) or across the different RAID group storage devices. Thus, a RAID group can include data (D) member devices (D) and parity member devices (P). The D member devices can store data corresponding to input/output (IO) write requests from one or more hosts. The P member devices can store the parity information such as “exclusive-ORs” (XORs) of the data stored on the D member devices. Thus, the management system can use the parity information to recover data if a D member device fails.
- Further, a storage array can issue storage confirmation responses to a host (e.g., a computing device) from which it receives an IO data write request. Accordingly, the storage array's performance can be a function of the time the host takes to receive the confirmation response. Because writing data to a storage device can be slow, the storage array's response time can be greater than the host-requirement performance. Thus, RAID techniques can include policies that improve a storage array's performance (e.g., response times). Specifically, the policies can instruct a storage array to return storage confirmation responses after caching IO write data (e.g., on volatile memory) but before writing it to one or more physical storage devices. However, such techniques can cause the storage array to issue false confirmation responses. For example, a host can issue an IO write request contemporaneous to a drive failure. In response to receiving the IO request, the storage array can determine that the IO request's target storage drive corresponds to the failed drive after it has already cached the IO and sent a confirmation response. Consequently, the storage array may be unable to destage the cached data to a storage drive, causing it to lose the data. Embodiments of the present disclosure relate to mitigating such data loss as described in greater detail herein.
- Regarding
FIG. 1 , a distributednetwork environment 100 can include astorage array 102, aremote system 140, and hosts 134. In embodiments, thestorage array 102 can includecomponents 104 that perform one or more distributed file storage services. In addition, thestorage array 102 can include one or moreinternal communication channels 112 like Fibre channels, busses, and communication modules that communicatively couple thecomponents 104. - In embodiments, the
storage array 102,components 104, andremote system 140 can include a variety of proprietary or commercially available single or multi-processor systems (e.g., parallel processor systems). The single or multi-processor systems can include central processing units (CPUs), graphical processing units (GPUs), and the like. Additionally, thestorage array 102,remote system 140, and hosts 134 can virtualize one or more of their respective physical computing resources (e.g., processors (not shown),memory 114, and storage devices 128). - In embodiments, the
storage array 102 and, e.g., one or more hosts 134 (e.g., networked devices) can establish anetwork 132. Similarly, thestorage array 102 and aremote system 140 can establish a remote network (RN 138). Further, thenetwork 132 or theRN 138 can have a network architecture that enables networked devices to send/receive electronic communications using a communications protocol. For example, the network architecture can define a storage area network (SAN), local area network (LAN), wide area network (WAN) (e.g., the Internet), and Explicit Congestion Notification (ECN), Enabled Ethernet network, and the like. Additionally, the communications protocol can include a Remote Direct Memory Access (RDMA), TCP, IP, TCP/IP protocol, SCSI, Fibre Channel, Remote Direct Memory Access (RDMA) over Converged Ethernet (ROCE) protocol, Internet Small Computer Systems Interface (iSCSI) protocol, NVMe-over-fabrics protocol (e.g., NVMe-over-ROCEv2 and NVMe-over-TCP), and the like. - Further, the
storage array 102 can connect to thenetwork 132 orRN 138 using one or more network interfaces. The network interface can include a wired/wireless connection interface, bus, data link, and the like. For example, a host adapter (HA) 106, e.g., a Fibre Channel Adapter (FA) and the like, can connect thestorage array 102 to the network 132 (e.g., SAN). Likewise, a remote adapter (RA) 130 can also connect thestorage array 102 to theRN 138. Further, thenetwork 132 andRN 138 can include communication mediums and nodes that link the networked devices. For example, communication mediums can include cables, telephone lines, radio waves, satellites, infrared light beams, etc. Additionally, the communication nodes can include switching equipment, phone lines, repeaters, multiplexers, and satellites. Further, thenetwork 132 orRN 138 can include a network bridge that enables cross-network communications between, e.g., thenetwork 132 andRN 138. - In embodiments, hosts 134 connected to the
network 132 can include client machines 136 a-136 b, running one or more applications. The applications can require one or more of the storage array's services. Accordingly, each application can send one or more input/output (IO) messages (e.g., a read/write request or other storage service-related request) to thestorage array 102 over thenetwork 132. Further, the IO messages can include metadata defining performance requirements according to a service level agreement (SLA) betweenhosts 134 and the storage array provider. - In embodiments, the
storage array 102 can include amemory 114 such as volatile or nonvolatile memory. Further, volatile and nonvolatile memory can include random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), and the like. Moreover, each memory type can have distinct performance characteristics (e.g., speed corresponding to reading/writing data). For instance, the types of memory can include register, shared, constant, user-defined, and the like. Furthermore, in embodiments, thememory 114 can include global memory (GM 118) that can cache IO messages and their respective data payloads. Additionally, thememory 114 can include local memory (LM 118) that stores instructions that the storage array's processor(s) can execute to perform one or more storage-related services. In addition, thestorage array 102 can deliver its distributed storage services using storage devices 128. For example, thestorage devices 126 can include multiple thin-data devices (TDATs) such as persistent storage devices 128 a-128 c. Further, each TDAT can have distinct performance capabilities (e.g., read/write speeds) like hard disk drives (HDDs) and solid-state drives (SSDs). - In embodiments, the
storage array 102 can include an Enginuity Data Services processor (EDS) 108 that performs one or more memory and storage self-optimizing operations (e.g., one or more machine learning techniques). Specifically, the operations can implement techniques that deliver performance, resource availability, data integrity services, and the like based on the SLA and the performance characteristics (e.g., read/write times) of the array'smemory 114 andstorage devices 126. For example, theEDS 108 can deliver hosts 134 (e.g., client machines 136 a-136 b) remote/distributed storage services by virtualizing the storage array's memory/storage resources (memory 114 andstorage devices 126, respectively). - In embodiments, the
storage array 102 can also include a controller 110 (e.g., management system controller) that can reside externally from or within thestorage array 102 and one or more of itscomponent 104. When external from thestorage array 102, thecontroller 110 can communicate with thestorage array 102 using any known communication connections. The communications connections can include a serial port, parallel port, network interface card (e.g., Ethernet), etc. Further, thecontroller 110 can include logic/circuitry that performs one or more storage-related services. For example, thecontroller 110 can have an architecture designed to manage the storage array's computing, storage, and memory resources as described in greater detail herein. - Regarding
FIG. 2 , thestorage array 102 can include anEDS 108 that virtualizes the array'sstorage devices 126. In embodiments, theEDS 108 can provide a host, e.g., client machine 138 a, with a virtual storage device (e.g., thin-device (TDEV)) that logically represents one or more of the storage array's memory/storage resources or physical slices/portions thereof. Further, theEDS 108 can provide each TDEV with a unique identifier (ID) like a target ID (TID). Additionally,EDS 108 can map each TID to its corresponding TDEV using a logical unit number (LUN) (e.g., a pointer to the TDEV). - For example, the
storage devices 126 can include anHDD 202 with stacks ofcylinders 204. Like a vinyl record's grooves, eachcylinder 204 can include one ormore tracks 206. Eachtrack 206 can include continuous sets of physical address spaces representing each of its sectors 208 (e.g., slices or portions thereof). TheEDS 108 can provide each slice/portion with a corresponding logical block address (LBA). Additionally, theEDS 108 can group sets of continuous LBAs to establish a virtual storage device (e.g., TDEV). Thus, each TDEV can include LBAs corresponding to one or more of the storage devices 128 or portions thereof. - As stated herein, the
storage devices 126 can have distinct performance capabilities. For example, an HDD architecture is known by skilled artisans to be slower than an SSD's architecture. Likewise, the array'smemory 114 can include different memory types, each with distinct performance characteristics described herein. In embodiments, theEDS 108 can establish a storage or memory hierarchy based on the SLA and the performance characteristics of the array's memory/storage resources. For example, the SLA can include one or more Service Level Objectives (SLOs) specifying performance metric ranges (e.g., response times and uptimes) corresponding to the hosts' performance requirements. - Further, the SLO can specify service level (SL) tiers corresponding to each performance metric range and categories of data importance (e.g., critical, high, medium, low). For example, the SLA can map critical data types to an SL tier requiring the fastest response time. Thus, the
storage array 102 can allocate the array's memory/storage resources based on an IO workload's anticipated volume of IO messages associated with each SL tier and the memory hierarchy. - For example, the
EDS 108 can establish the hierarchy to include one or more tiers (e.g., subsets of the array's storage and memory) with similar performance capabilities (e.g., response times and uptimes). Thus, theEDS 108 can establish fast memory and storage tiers to service host-identified critical and valuable data (e.g., Platinum, Diamond, and Gold SLs). In contrast, slow memory and storage tiers can service host-identified non-critical and less valuable data (e.g., Silver and Bronze SLs). Additionally, theEDS 108 can define “fast” and “slow” performance metrics based on relative performance measurements of the array'smemory 114 andstorage devices 126. Thus, the fast tiers can includememory 114 andstorage devices 126 with relative performance capabilities exceeding a first threshold. In contrast, slower tiers can includememory 114 and storage devices 128, with relative performance capabilities falling below a second threshold. In embodiments, the first and second thresholds can correspond to the same threshold. - In embodiments, the
EDS 108 can establish logical tracks (e.g., track identifiers (TIDs) by creating LBA groups that include LBAs corresponding to anystorage devices 126. For example, theEDS 108 can establish a virtual storage device (e.g., a logical unit number (LUN)) by creating TID groups. Further, theEDS 108 can generate a searchable data structure, mapping logical storage representations to their corresponding physical address spaces. Further, theHA 106 can present thehosts 134 with the logical memory and storage representations based on host or application performance requirements. - Regarding
FIG. 3A , thestorage array 102 can include acontroller 110 that includes logic/circuitry configured to perform one or more memory and storage management techniques. For example, thecontroller 110 can establish one or morevirtual storage devices 304 as described above. Thevirtual storage devices 304 can include thin devices such asTDEV 306. Specifically, thecontroller 110 can virtualize portions and one ormore storage devices 126 to establish, e.g., theTDEV 306. In embodiments, thecontroller 110 can establish a redundant array of independent disks (RAID) storage group (RG) such asRG 308 using one or more of thestorage devices 126. For example, thecontroller 110 can establishRG 308 using storage volumes 310 a-310 d selected from thestorage devices 126. Accordingly, thecontroller 110 can establish theTDEV 306 using theRG 308. - In embodiments, the
RG 308 can include data members (D) and parity members (P). The D-members can store data, while the P-members can store parity information (e.g., XORs of the data). Thecontroller 110 and the RG members (e.g., physical storage volumes 310 a-310 d) can access the parity information to discover information corresponding to each member's stored data. Accordingly, the parity information allows thecontroller 110 to distribute data across all the RG members 310 a-310 d and recover data if one or more D-members 310 a-310 d fail. - In embodiments, the
storage array 102 can include IO workloads having IO write requests targeting theTDEV 306. Thecontroller 110 can cache such IO write requests in acache 330 corresponding toGM 116. Specifically, thecache 330 can includecache slots 318 corresponding to portions of theGM 116. Additionally, eachcache slot 332 can correspond to an RG member's track (e.g., track 206). Thus, in some examples, thecontroller 110 can only destage acache slot 332 once it is filled. - In embodiments, the
controller 110 can assign theRG 308 with memory resources. For example, thecontroller 110 can analyze metadata from an IO workload's corresponding IO requests. The metadata can include information like IO size, IO type, a TID/LUN, and performance requirements, amongst other related information. Thecontroller 110 can generate workload models to form predictions corresponding toIOs targeting TDEV 306. Thus, thecontroller 110 can mapcache slots 316 to corresponding RG member slices (e.g.,sector 208 ofFIG. 2 ). Accordingly, thecontroller 110 can obtain a TID, LBA, or LUN from an IO request's metadata and cache the IO request and its payload in one or more of thecache slots 316 corresponding to the TID, LBA, or LUN. - Generally, hosts 134 use the
storage array 102 as a remote/distributed persistent data storage solution. However, writing data to one or more of thephysical storage devices 126 can span a duration that fails to satisfy an SLA. Thus, thecontroller 110 can issue theHA 106 instructions to send the hosts 134 a storage confirmation when an IO and its payload are cached but before they are destaged to persistent physical storage (e.g., the RG's corresponding physical storage volumes 310 a-310 d). - Occasionally, a RAID group like
RG 308 can experience multiple storage volume failures causing thecontroller 110 to flag each failed volume as not ready (NR). For example, such a failure can result in theRG 308 havingNR members healthy members 310 b-310 c. Thus, thecontroller 110 can perform one or more operations to recover storage services for theTDEV 306. Specifically, the controller can provide theTDEV 306 with a new RG using one or more of the array'savailable storage devices 126. Further, thecontroller 110 can performdrain techniques 326 to migrate data from thehealthy members 310 b-310 c to one or more of theavailable storage devices 126 allocated to the new RG. - However, the failure experienced by the
RG 308 can occur before corresponding cached IO write requests and data payloads can be destaged to persistent physical storage. Additionally, thestorage array 102 can receive and cache additional IO requests targeting theRG 308 and itsNR members drain 326. Thus, one or more of thecache slots 316 can correspond to a partially filled write-pendingtrack 206 of one of theNR members controller 110 can assigncache slot 332 to cache data corresponding to a track (e.g., track 206 ofFIG. 2 ) fromNR member 310 a. Thecache slot 332 can include filled cache blocks 314, and empty cache blocks 312 corresponding to the track fromNR member 310 a. Consequently, current naïve approaches discard the write pending data from thecache slot 332, resulting in data loss. In contrast, thecontroller 110 can performcache recovery operations 328 to prevent such data loss, as described in greater detail herein. - Regarding
FIG. 3B , thecontroller 110 can include logic/circuitry designed to perform thecache recovery operations 328. For instance, thecache recovery operations 328 can include techniques that prevent data from partially filled write-pending cached track of a RAID group's NR member from becoming lost. - In embodiments, the
storage array 102 can include one ormore daemons 334 that can monitor the array'scomponents 104. For example, thedaemons 334 can establish alink 338 to the array'scomponents 104 to monitor thestorage devices 126 andcache 330. Further, thedaemons 334 can record events corresponding to the storage devices 128 andcache 330 in one or more activity logs. Additionally, thedaemons 334 can record each component's global ready state from each component's device header. - In embodiments, the
controller 110 can obtain the activity logs to identify eachRG 308 and its corresponding member states. Accordingly, thecontroller 110 can periodically or randomly perform a read of the activity logs to identify device states. In response to identifying theNR members RG 308, thecontroller 110 can perform one or more operations to recover storage services for theTDEV 306. For example, thecontroller 110 can perform adrain 326 of thehealthy members 310 b-310 c as described above. Additionally, thecontroller 110 can identify anycache slots 318, including filled cached data blocks corresponding to one ormore RG members 308. - In embodiments, the
controller 110 can identify cache slots, likecache slot 332, corresponding to a partially filled WP (write-pending)track 318 of, e.g.,NR member 310 a. For example, thecache slot 332 can include empty cache blocks 312, and NR-related filled cache blocks 314. The empty cache blocks 312 can correspond to sectors 0-7, and the NR-related filled cache blocks 314 can correspond to sectors 8-9, A-F of the partial WP cached tracks 318. Furthermore, the empty cache blocks 312 can correspond to respective sets ofcontinuous LBAs 320 of theTDEV 306. Thus, each set ofcontiguous LBAs 320 can correspond to a sector of theNR member 310 a. Further, each LBA can correspond todata 322 or a portion thereof stored by theTDEV 306. Additionally, theLBAs 320 can includemetadata 324 with information corresponding to thedata 322. For example, an LBA'smetadata 324 can define its related physical address space corresponding to theNR member 310 a. - In embodiments, the
controller 110 can generate fake data using a data generator (not shown). For instance, thecontroller 110 can provide the data generator with a total size corresponding to the empty cache blocks 312 so it can generate the fake data with a size corresponding to the empty cache blocks 312. For example, the generator can provide thecontroller 110 with a string of zeros to fill the empty cache blocks 312. In response to filling the empty cache blocks 312, thecontroller 110 can flag thecache slot 332 as a filled WP cached track. Accordingly, thecontroller 110 can further destage the now filled WP cached tracks to the new RG established for theTDEV 306. Once each partially filled cache slot corresponding toRG 308 is filled and destaged, thecontroller 110 can flag the TDEV's new RG as ready (e.g., healthy). - The following text includes details of one or more methods or techniques disclosed herein. Each method is depicted and described as one or more acts for context and without limitation. Each act can occur in various orders or concurrently with other acts described herein, or neither presented nor described herein. Furthermore, each act can be optional and, thus, not required to implement each method described herein.
- Regarding
FIG. 4 , one or more of the array'scomponents 104 can execute amethod 400 that includes acts to mitigate data loss resulting from storage device failures. In embodiments, themethod 400, at 402, can include receiving an input/output (IO) workload by a storage array. At 404,method 400 can also include relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure. Additionally, each act (e.g., step or routine) of themethod 400 can include any combination of techniques described herein. - Regarding
FIG. 5 , one or more of the array'scomponents 104 can execute amethod 500 that includes acts to preserve data in response to a drive failure. In embodiments, themethod 500, at 502, can include receiving an input/output (IO) workload by a storage array. At 504,method 500 can include relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure. Further,method 500, at 506, can include determining whether a drain event has activated in response to the two or more storage device failures. Additionally, at 508,method 500 can include identifying at least one of the IO requests targeting the two or more failed storage devices. Additionally, each act (e.g., step or routine) of themethod 500 can include any combination of techniques described herein. - Using the teachings disclosed herein, a skilled artisan can implement the above-described systems and methods in digital electronic circuitry, computer hardware, firmware, or software. The implementation can be a computer program product. Additionally, the implementation can include a machine-readable storage device for execution by or to control the operation of a data processing apparatus. The implementation can, for example, be a programmable processor, a computer, or multiple computers.
- A computer program can be in any programming language, including compiled or interpreted languages. The computer program can have any deployed form, including a stand-alone program, subroutine, element, or other units suitable for a computing environment. One or more computers can execute a deployed computer program.
- One or more programmable processors can perform the method steps by executing a computer program to perform the concepts described herein by operating on input data and generating output. An apparatus can also perform the method steps. The apparatus can be a special purpose logic circuitry. For example, the circuitry is an FPGA (field-programmable gate array) or an ASIC (application-specific integrated circuit). Subroutines and software agents can refer to portions of the computer program, the processor, the special circuitry, software, or hardware that implements that functionality.
- Processors suitable for executing a computer program include, by way of example, both general and special purpose microprocessors and any one or more processors of any digital computer. A processor can receive instructions and data from a read-only memory, a random-access memory, or both. Thus, for example, a computer's essential elements are a processor for executing instructions and one or more memory devices for storing instructions and data. Additionally, a computer can receive data from or transfer data to one or more mass storage device(s) for storing data (e.g., magnetic, magneto-optical disks, solid-state drives (SSDs, or optical disks).
- Data transmission and instructions can also occur over a communications network. Information carriers that embody computer program instructions and data include all nonvolatile memory forms, including semiconductor memory devices. The information carriers can, for example, be EPROM, EEPROM, flash memory devices, magnetic disks, internal hard disks, removable disks, magneto-optical disks, CD-ROM, or DVD-ROM disks. In addition, the processor and the memory can be supplemented by or incorporated into special purpose logic circuitry.
- A computer having a display device that enables user interaction can implement the above-described techniques such as a display, keyboard, mouse, or any other input/output peripheral. The display device can, for example, be a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor. The user can provide input to the computer (e.g., interact with a user interface element). In addition, other kinds of devices can provide for interaction with a user. Other devices can, for example, be feedback provided to the user in any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback). Input from the user can, for example, be in any form, including acoustic, speech, or tactile input.
- A distributed computing system with a back-end component can also implement the above-described techniques. The back-end component can, for example, be a data server, a middleware component, or an application server. Further, a distributing computing system with a front-end component can implement the above-described techniques. The front-end component can, for example, be a client computer having a graphical user interface, a Web browser through which a user can interact with an example implementation, or other graphical user interfaces for a transmitting device. Finally, the system's components can interconnect using any form or medium of digital data communication (e.g., a communication network). Examples of communication network(s) include a local area network (LAN), a wide area network (WAN), the Internet, wired network(s), or wireless network(s).
- The system can include a client(s) and server(s). The client and server (e.g., a remote server) can interact through a communication network. For example, a client and server relationship can arise by computer programs running on the respective computers and having a client-server relationship. Further, the system can include a storage array(s) that delivers distributed storage services to the client(s) or server(s).
- Packet-based network(s) can include, for example, the Internet, a carrier internet protocol (IP) network (e.g., local area network (LAN), wide area network (WAN), campus area network (CAN), metropolitan area network (MAN), home area network (HAN)), a private IP network, an IP private branch exchange (IPBX), a wireless network (e.g., radio access network (RAN), 802.11 network(s), 802.18 network(s), general packet radio service (GPRS) network, HiperLAN), or other packet-based networks. Circuit-based network(s) can include, for example, a public switched telephone network (PSTN), a private branch exchange (PBX), a wireless network, or other circuit-based networks. Finally, wireless network(s) can include RAN, Bluetooth, code-division multiple access (CDMA) network, time division multiple access (TDMA) network, and global system for mobile communications (GSM) network.
- The transmitting device can include, for example, a computer, a computer with a browser device, a telephone, an IP phone, a mobile device (e.g., cellular phone, personal digital assistant (P.D.A.) device, laptop computer, electronic mail device), or other communication devices. The browser device includes, for example, a computer (e.g., desktop computer, laptop computer) with a world wide web browser (e.g., Microsoft Internet Explorer® and Mozilla®). The mobile computing device includes, for example, a Blackberry®.
- Comprise, include, or plural forms of each are open-ended, include the listed parts, and contain additional unlisted elements. Unless explicitly disclaimed, the term ‘or’ is open-ended and includes one or more of the listed parts, items, elements, and combinations thereof.
Claims (20)
1. A method comprising:
receiving an input/output (IO) workload by a storage array; and
relocating the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure, wherein relocating the IO workload's corresponding IO requests includes determining a drain event has activated in response to two or more storage device failures.
2. The method of claim 1 , further comprising:
detecting two or more storage device failures while the storage array receives the IO workload.
3. The method of claim 2 , further comprising:
identifying each of the two or more storage device failures as belonging to a specific redundant array of independent disks (RAID) group of a plurality of RAID groups.
4. (canceled)
5. The method of claim 1 , further comprising:
identifying at least one of the IO requests targeting the two or more storage devices.
6. The method of claim 5 , further comprising:
determining that the at least one IO request is a write pending request cached in one or more memory cache slots; and
anticipating receiving additional IO requests from the IO workload targeting the two or more failed storage devices.
7. The method of claim 6 , further comprising:
identifying one or more of the cache slots corresponding to the write pending request being partially filled.
8. The method of claim 7 , further comprising:
writing data to each empty data block of the partially filled cache slots.
9. The method of claim 3 , further comprising:
identifying each storage drive related to a drain of the RAID group's healthy drives, where each identified storage drive replaces the storage drives currently assigned to the RAID group.
10. The method of claim 1 , further comprising:
reallocating cached write pending requests to at least one of the storage drive replacements.
11. A system configured to:
receive an input/output (IO) workload by a storage array; and
relocate the IO workload's corresponding IO requests stored in the storage array's cache in response to a storage device failure, wherein relocating the IO workload's corresponding IO requests includes determining a drain event has activated in response to two or more storage device failures.
12. The system of claim 11 , further configured to:
detect two or more storage device failures while the storage array receives the IO workload.
13. The system of claim 12 , further configured to:
identify each of the two or more storage device failures as belonging to a specific redundant array of independent disks (RAID) group of a plurality of RAID groups.
14. (canceled)
15. The system of claim 11 , further configured to:
identify at least one of the IO requests targeting the two or more storage devices.
16. The system of claim 15 , further configured to:
determine that the at least one IO request is a write pending request cached in one or more memory cache slots; and
anticipate receiving additional IO requests from the IO workload targeting the two or more failed storage devices.
17. The system of claim 16 , further configured to:
identify one or more of the cache slots corresponding to the write pending request being partially filled.
18. The system of claim 17 , further configured to:
write data to each empty data block of the partially filled cache slots.
19. The system of claim 13 , further configured to:
identify each storage drive related to a drain of the RAID group's healthy drives, where each identified storage drive replaces the storage drives currently assigned to the RAID group.
20. The system of claim 11 , further configured to:
reallocate cached write pending requests to at least one of the storage drive replacements.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/862,694 US11853174B1 (en) | 2022-07-12 | 2022-07-12 | Multiple drive failure data recovery |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/862,694 US11853174B1 (en) | 2022-07-12 | 2022-07-12 | Multiple drive failure data recovery |
Publications (2)
Publication Number | Publication Date |
---|---|
US11853174B1 US11853174B1 (en) | 2023-12-26 |
US20240020208A1 true US20240020208A1 (en) | 2024-01-18 |
Family
ID=89384052
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/862,694 Active US11853174B1 (en) | 2022-07-12 | 2022-07-12 | Multiple drive failure data recovery |
Country Status (1)
Country | Link |
---|---|
US (1) | US11853174B1 (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5708771A (en) * | 1995-11-21 | 1998-01-13 | Emc Corporation | Fault tolerant controller system and method |
US20020152362A1 (en) * | 2001-04-17 | 2002-10-17 | Cochran Robert A. | Unified data sets distributed over multiple I/O-device arrays |
US20060026347A1 (en) * | 2004-07-29 | 2006-02-02 | Ching-Hai Hung | Method for improving data reading performance and storage system for performing the same |
US20060277328A1 (en) * | 2005-06-06 | 2006-12-07 | Dell Products L.P. | System and method for updating the firmware of a device in a storage network |
US20140365726A1 (en) * | 2011-07-12 | 2014-12-11 | Violin Memory, Inc. | Memory system management |
US20190332325A1 (en) * | 2018-04-28 | 2019-10-31 | EMC IP Holding Company LLC | Method, device and computer readable medium of i/o management |
JP2020502606A (en) * | 2016-11-16 | 2020-01-23 | サンディスク テクノロジーズ エルエルシー | Store operation queue |
US20210004160A1 (en) * | 2019-07-02 | 2021-01-07 | International Business Machines Corporation | Prefetching data blocks from a primary storage to a secondary storage system while data is being synchronized between the primary storage and secondary storage |
US20210019083A1 (en) * | 2019-07-17 | 2021-01-21 | International Business Machines Corporation | Application storage segmentation reallocation |
US11016824B1 (en) * | 2017-06-12 | 2021-05-25 | Pure Storage, Inc. | Event identification with out-of-order reporting in a cloud-based environment |
-
2022
- 2022-07-12 US US17/862,694 patent/US11853174B1/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5708771A (en) * | 1995-11-21 | 1998-01-13 | Emc Corporation | Fault tolerant controller system and method |
US20020152362A1 (en) * | 2001-04-17 | 2002-10-17 | Cochran Robert A. | Unified data sets distributed over multiple I/O-device arrays |
US20060026347A1 (en) * | 2004-07-29 | 2006-02-02 | Ching-Hai Hung | Method for improving data reading performance and storage system for performing the same |
US20060277328A1 (en) * | 2005-06-06 | 2006-12-07 | Dell Products L.P. | System and method for updating the firmware of a device in a storage network |
US20140365726A1 (en) * | 2011-07-12 | 2014-12-11 | Violin Memory, Inc. | Memory system management |
JP2020502606A (en) * | 2016-11-16 | 2020-01-23 | サンディスク テクノロジーズ エルエルシー | Store operation queue |
US11016824B1 (en) * | 2017-06-12 | 2021-05-25 | Pure Storage, Inc. | Event identification with out-of-order reporting in a cloud-based environment |
US20190332325A1 (en) * | 2018-04-28 | 2019-10-31 | EMC IP Holding Company LLC | Method, device and computer readable medium of i/o management |
US20210004160A1 (en) * | 2019-07-02 | 2021-01-07 | International Business Machines Corporation | Prefetching data blocks from a primary storage to a secondary storage system while data is being synchronized between the primary storage and secondary storage |
US20210019083A1 (en) * | 2019-07-17 | 2021-01-21 | International Business Machines Corporation | Application storage segmentation reallocation |
Also Published As
Publication number | Publication date |
---|---|
US11853174B1 (en) | 2023-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11086740B2 (en) | Maintaining storage array online | |
US9069587B2 (en) | System and method to cache hypervisor data | |
US10296255B1 (en) | Data migration techniques | |
US8959389B2 (en) | Use of a virtual drive as a hot spare for a raid group | |
US8782335B2 (en) | Latency reduction associated with a response to a request in a storage system | |
US20070130423A1 (en) | Data migration method and system | |
US20170177224A1 (en) | Dynamic storage transitions employing tiered range volumes | |
CN111095188A (en) | Dynamic data relocation using cloud-based modules | |
US10067682B1 (en) | I/O accelerator for striped disk arrays using parity | |
US11416156B2 (en) | Object tiering in a distributed storage system | |
US11347414B2 (en) | Using telemetry data from different storage systems to predict response time | |
US11625327B2 (en) | Cache memory management | |
US20210255964A1 (en) | Integration of application indicated minimum time to cache for a two-tiered cache management mechanism | |
US11853174B1 (en) | Multiple drive failure data recovery | |
US11687243B2 (en) | Data deduplication latency reduction | |
US11144445B1 (en) | Use of compression domains that are more granular than storage allocation units | |
US11687443B2 (en) | Tiered persistent memory allocation | |
US11656769B2 (en) | Autonomous data protection | |
US11210237B2 (en) | Integration of application indicated minimum and maximum time to cache for a two-tiered cache management mechanism | |
US12131033B2 (en) | Extending flash media endurance | |
US12099722B2 (en) | Resiliency SLO protocol layer | |
US20240256143A1 (en) | Extending flash media endurance | |
US12073124B2 (en) | Array resource based input/output (IO) path selection in a distributed services environment | |
US12072910B2 (en) | Asynchronous remote replication data deduplication layer | |
US20240143790A1 (en) | Encryption Key Enhancement Of Storage Array Snapshots |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: DELL PRODUCTS L.P., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FARRELL, EAMONN;REEL/FRAME:060576/0678 Effective date: 20220706 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |