WO2016032955A2 - Systèmes de stockage prenant en charge la nvram - Google Patents

Systèmes de stockage prenant en charge la nvram Download PDF

Info

Publication number
WO2016032955A2
WO2016032955A2 PCT/US2015/046534 US2015046534W WO2016032955A2 WO 2016032955 A2 WO2016032955 A2 WO 2016032955A2 US 2015046534 W US2015046534 W US 2015046534W WO 2016032955 A2 WO2016032955 A2 WO 2016032955A2
Authority
WO
WIPO (PCT)
Prior art keywords
nvram
storage
data
persistent
raid
Prior art date
Application number
PCT/US2015/046534
Other languages
English (en)
Other versions
WO2016032955A3 (fr
Inventor
Bruce Eric MANN
Matthew Edward Cross
Arthur James BEAVERSON
Bang Chang
Original Assignee
Cacheio Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cacheio Llc filed Critical Cacheio Llc
Publication of WO2016032955A2 publication Critical patent/WO2016032955A2/fr
Publication of WO2016032955A3 publication Critical patent/WO2016032955A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C8/00Arrangements for selecting an address in a digital store
    • G11C8/06Address interface arrangements, e.g. address buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates generally to storage systems, and, more specifically, to non-volatile random access memory (NVRAM) enabled storage systems.
  • NVRAM non-volatile random access memory
  • SSDs Solid-state devices
  • RAID Redundant Array of Independent Disks
  • the RAID system may take the form of a hardware RAID card, RAID on a Chip, software RAID, Erasure Coding, or JBOD (Just a Bunch of Disks).
  • Transactional applications typically issue read and write requests (I/O requests) that have small transfer sizes and are not in sequential block address order (collectively referred to as "random" I/O requests).
  • SSDs typically service random read requests many times faster than traditional hard disk drives (HDDs).
  • SSD write amplification also reduces write performance and SSD endurance.
  • An SSD is comprised of a plurality of flash pages. An entire flash page must be "erased" before it can be rewritten. There is a write cycle limit to how many times a flash page can be erased and rewritten.
  • a transactional application writes to an SSD in a RAID system
  • its write request size will likely be much smaller than the SSD's flash page size, resulting in partially written flash pages. Consequently the SSD has to perform garbage collection by moving user data from one partially written flash page to another until an entire flash page contains no more user data and can be erased.
  • Garbage collection turns each application write into multiple SSD writes, also known as write amplification. Given the write cycle limit on each flash page, write amplification significantly reduces SSD endurance and write performance. When an application or a storage system writes to an SSD in multiple small transfer sizes in sequential block address order (sequential writes), the SSD typically can fill entire flash pages with fewer partially written pages, reducing the write amplification during garbage collection.
  • An SSD typically comprises a persistent flash medium for storing data and a volatile memory to hold data temporarily before the data is committed to the persistent flash medium.
  • a persistent flash medium for storing data
  • a volatile memory to hold data temporarily before the data is committed to the persistent flash medium.
  • the data stored in the volatile memory will be lost.
  • some of the SSDs are equipped with a capacitor or battery, which provides enough power for flushing the data stored in the volatile memory to the persistent flash medium.
  • the additional capacitor can significantly increase the cost of the SSDs.
  • a storage system is configured to prevent data loss in the event of power failure.
  • the storage system comprises a processor, one or more storage devices and a non-volatile memory (NVRAM).
  • Each of the one or more storage devices comprises a persistent medium.
  • the NVRAM device is configured to store one or more data blocks to be sent to a storage device for persistent storage.
  • the processor is configured to check whether a data block stored on the NVRAM is also stored on the storage device's persistent medium before deleting the data block from the NVRAM.
  • a storage system is configured to reduce read-modify- write operations and write amplification.
  • the storage system comprises a processor, a RAID system with one or more storage devices, a NVRAM device and a memory.
  • the NVRAM device stores one or more data blocks that are to be sent to the RAID system for persistent storage.
  • the memory stores a metadata that maps every data block's logical block address (LBA) to its physical block address (PBA).
  • LBA logical block address
  • PBA physical block address
  • the processor is configured to handle random write requests from an application. When handling random write requests, the processor first stores a data block in the NVRAM, then
  • the full RAID stripe is written to the RAID system to reduce read-modify-write operations.
  • the metadata is updated to map the LBAs of the one or more data blocks to the PBAs of the one or more data blocks.
  • the one or more data blocks are deleted from the NVRAM after the metadata has been updated.
  • Figure 1 i lustrates a block diagram of a storage system with NVRAM devices
  • Figure 2 i lustrates a block diagram of a storage device with volatile memory and persistent medium.
  • Figure 3 lustrates the deferred write process at the NVRAM.
  • Figure 4 lustrates the deferred write process at the storage device.
  • Figure 5 lustrates a flow diagram of the check block persistent process.
  • Figure 6 lustrates a flow diagram of the make block persistent process.
  • Figure 7 lustrates a block diagram of a storage system with a RAID system.
  • Figure 8 lustrates a block diagram of RAID system data layout.
  • Figure 9 lustrates a flow diagram of NVRAM enabled writes to a RAID system.
  • Figure 10 illustrates a flow diagram of NVRAM enabled metadata updates.
  • a storage system has at least one NVRAM device to accomplish (1 ) preventing data loss in the event of a power failure; (2) reducing read-write-modify operations; and (3) reducing solid-state device write amplification.
  • Fig. 1 illustrates one embodiment of a storage system 100 that includes a processor 1 1 0 and one or more storage devices 120.
  • Examples of storage device include solid-state device (SSD), hard disk drive (HDD), and a combination of SSDs and HDDs (Hybrid).
  • the storage system 100 provides persistent storage to one or more user applications 140.
  • the storage device 120 may be accessible by multiple storage systems 100 as shared storage device.
  • the application 140 and the storage system 100 may be running on the same physical system.
  • the application 140 may access the storage system through a storage network such as FibreChannel, Ethernet, InfiniBand, and PCIe.
  • the processor 1 1 0 interfaces between the application 140 and the storage device 120.
  • the processor 1 1 0 controls and manages the storage device 120.
  • the processor 1 1 0 may provide a set of commands for the application 140 to read from and write to the storage device 120.
  • the processor 1 10 can provide redundancy, performance, and data services that often can't be achieved by the storage device 120.
  • the storage system 100 includes one or more non- volatile random-access memory (NVRAM) devices 1 30.
  • NVRAM non- volatile random-access memory
  • Examples of NVRAM include battery-backed DRAM, NVDIMM, PCIe NVRAM card, and solid-state device.
  • the processor 1 10 upon receiving a write request from the application 140, stores the write data in the NVRAM 130 and acknowledges to the application 140 that the write request is successful before the data is actually committed to the storage device 120. This process is known as deferred write.
  • Fig. 2 illustrates one embodiment of a storage device 1 20, such as a solid-state device (SSD), that comprises a persistent medium 172 for storing data and a volatile memory 174 for buffering data temporarily before the data is committed to the persistent medium 172.
  • SSD solid-state device
  • the data stored in the volatile memory 174 will be lost.
  • an SSD is equipped with a capacitor or battery, which provides enough power to write all the data in the volatile memory 174 to the persistent medium 172.
  • the capacitor or battery can significantly increase the cost of the SSD.
  • the present disclosure provides methods for preventing data loss during power failure without the additional capacitor or battery in the SSD.
  • Fig. 3 illustrates the deferred write process at the NVRAM 130:
  • Step 1 The processor 1 10 receives a write request from the application 140;
  • Step 2 The processor 1 10 commits the write data to the NVRAM 130;
  • Step 3 The processor 1 10 acknowledges to the application 140 that the write is successful
  • Step 4 At a later time the processor 1 10 writes the data in the NVRAM 130 to the storage device 120 (deferred write);
  • Step 5 At a later time the processor 1 10 deletes the data from the NVRAM 1 30 so the NVRAM space can be reused.
  • Fig. 4 illustrates the deferred write process at the storage device 120: Step 4': The storage device 120 receives a write request from the processor; Step 6: The storage device 1 20 stores the write data in its volatile memory 174; Step 7: The storage device 1 20 acknowledges to the processor 1 1 0 that the write is successful;
  • Step 8 At a later time the storage device 120 writes the data in the volatile memory 174 to its persistent medium 172
  • step 5 If a power failure takes place after step 5 but before step 8, the write data will be lost. In order to prevent data loss the present disclosure replaces step 5 with the following steps as illustrated in Fig. 5:
  • Step 510 The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its NVRAM space can be reused;
  • Step 520 The processor 1 10 checks if the data block is on the storage device's persistent medium 172 by issuing a "check block persistent" request to the storage device 120;
  • Step 530 If the storage device responds "yes", the processor 1 10 deletes the data block from the NVRAM 130.
  • Step 540 If the storage device responds "no", the processor 1 10 issues a "flush block” request to the storage device 120. Upon receiving the request, the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172 and acknowledges completion to the processor 1 10;
  • Step 550 Upon receiving the acknowledgement the processor 1 10 deletes the data block from the NVRAM 130.
  • Fig. 6 illustrates another embodiment of the present disclosure by replacing step 5 with the following steps:
  • Step 610 The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its space can be reused;
  • Step 620 The processor 1 10 issues "make block persistent" request to the storage device 120. Upon receiving the request, the storage device 120 checks if the data block is on its persistent medium. If not the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172. The storage device then acknowledges completion to the processor 1 1 0. Step 630: The processor 1 10 receives the completion acknowledgement for "make block persistent" request;
  • Step 640 The processor 1 10 deletes the data block from the NVRAM 130.
  • the NVRAM device 130 is configured to be much larger than the aggregate size of the volatile memory in the storage devices 120. This allows the processor 1 10 to delay deleting a data from the NVRAM as much as possible so the data is more likely to have been flushed to the persistent medium 1 72 before a check block consistent or make block persistent request.
  • the storage device 120 always writes data from its volatile memory 174 to its persistent medium 172 first in first out (FIFO). In these embodiments the processor may calculate whether a data is on the persistent medium without the check block persistent or make block persistent request.
  • Fig. 7 illustrates one embodiment of a storage system 100 that includes a RAID (Redundant Array of Independent Disks) system 150 between the processor 1 10 and the storage device 120.
  • RAID system includes software RAID, hardware RAID card, RAID on a chip, Erasure Coding, or JBOD (Just a Bunch of Disks).
  • the RAID system may be configured in write through (WT) or write back (WB) mode.
  • the RAID system 150 virtualizes multiple storage devices 120 into logical units.
  • the RAID system 150 may be implemented to distribute data blocks across the storage devices 120 (i.e., striping) and generate checksums (i.e., parity) for data redundancy and recovery.
  • the RAID system introduces read- modify-write operations that reduce write performance.
  • the storage devices 120 are SSDs. Small transactional writes to a SSD causes write amplification, which reduces SSD endurance and write performance.
  • the processor 1 10 maintains Metadata 160 that maps every data block's LBA (Logical Block Address) to its PBA (Physical Block Address).
  • LBA is the virtual block address assigned and accessed by the application 140 whereas PBA represents the block's physical location in the RAID system 150.
  • the metadata 1 60 may map every data block's LBA to its content ID (e.g. content fingerprint) and every content ID to its PBA.
  • Fig. 8 illustrates one embodiment wherein the storage devices 1 20 in the RAID system 150 are managed as one or more Beads 186. Each Bead comprises one or more contiguous RAID stripes 184. Each RAID stripe comprises one chunk 182 from each storage device.
  • the processor 1 1 0 is configured to write to fill one or more Beads (current Beads) before writing to new Beads.
  • Fig. 9 illustrates the write data flow:
  • Step 710 Upon receiving an application write request, the processor 1 1 0 commits the write data in the NVRAM 130 and acknowledges completion to the application 140.
  • the processor 1 10 accumulates one or more data blocks in the
  • Step 720 The processor 1 10 checks if the current Bead is filled
  • Step 730 If not the processor 1 10 writes the full RAID stripe in one or more transfers after the existing RAID stripes in the current Bead;
  • Step 740 If yes the processor 1 1 0 writes the RAID stripe in one or more transfers at the beginning of a new Bead;
  • Step 750 The processor 1 10 updates the metadata 160 to map the LBA of each data block in the RAID stripe to its PBA;
  • Step 760 At a later time the processor 1 10 deletes the data blocks from the NVRAM so their NVRAM space can be reused
  • the above write data flow ensures that the RAID system 150 receives mostly full stripe write (FSW) requests, which cause fewer or no read-modify-write operations. It also ensures that most data is written to each SSD in contiguous chunks (sequential writes), which reduces the SSD's write amplification.
  • FSW stripe write
  • the processor 1 10 makes metadata 160 persistent by writing metadata updates to the RAID system 150.
  • Metadata updates are typically of small transfer sizes and are another source for read-modify-writes and SSD write amplification.
  • Fig. 10 illustrates the metadata update data flow for reducing read-modify- writes and write amplification:
  • Step 810 The processor 1 10 commits metadata update in the NVRAM 1 30;
  • Step 820 The processor 1 10 accumulates one or more metadata updates in the NVRAM into one RAID stripe 184;
  • Step 830 The processor 1 10 writes the RAID stripe in one or more transfers to the current Bead or a new Bead;
  • Step 840 The processor 1 10 updates the metadata index to the PBA of the on- disk metadata structure;
  • the metadata updates have their own Beads separate from data Beads. In other embodiments the metadata updates are mixed with data in the same Beads.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

L'invention concerne un procédé permettant d'empêcher la perte de données, qui comprend d'abord l'écriture de données sur un dispositif de mémoire vive non volatile (NVRAM), puis sur un dispositif de stockage à mémoire volatile et support persistant, et la vérification du fait que des données sont sur le support persistant avant de supprimer les données de la NVRAM. Un procédé de réduction de lecture-modification-écriture et d'amplification d'écriture est également divulgué, lequel consiste d'abord à écrire des données sur un dispositif NVRAM, à accumuler des données dans des bandes complètes avant d'écrire les données sur un système RAID en bandes complètes et sur chaque dispositif de stockage en unités contiguës de manière séquentielle.
PCT/US2015/046534 2014-08-25 2015-08-24 Systèmes de stockage prenant en charge la nvram WO2016032955A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462041318P 2014-08-25 2014-08-25
US62/041,318 2014-08-25

Publications (2)

Publication Number Publication Date
WO2016032955A2 true WO2016032955A2 (fr) 2016-03-03
WO2016032955A3 WO2016032955A3 (fr) 2016-04-21

Family

ID=55400802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/046534 WO2016032955A2 (fr) 2014-08-25 2015-08-24 Systèmes de stockage prenant en charge la nvram

Country Status (1)

Country Link
WO (1) WO2016032955A2 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106356097A (zh) * 2016-08-25 2017-01-25 浙江宇视科技有限公司 一种防止数据丢失的保护方法和装置
WO2018022091A1 (fr) * 2016-07-29 2018-02-01 Hewlett-Packard Development Company, L.P. Déverrouillage de dispositifs de stockage lisibles par machine à l'aide d'un jeton utilisateur
CN114201115A (zh) * 2021-12-14 2022-03-18 北京达佳互联信息技术有限公司 数据存储系统、方法、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8074019B2 (en) * 2007-11-13 2011-12-06 Network Appliance, Inc. Preventing data loss in a storage system
US7761740B2 (en) * 2007-12-13 2010-07-20 Spansion Llc Power safe translation table operation in flash memory
US10346095B2 (en) * 2012-08-31 2019-07-09 Sandisk Technologies, Llc Systems, methods, and interfaces for adaptive cache persistence
US9081712B2 (en) * 2012-12-21 2015-07-14 Dell Products, L.P. System and method for using solid state storage systems as a cache for the storage of temporary data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018022091A1 (fr) * 2016-07-29 2018-02-01 Hewlett-Packard Development Company, L.P. Déverrouillage de dispositifs de stockage lisibles par machine à l'aide d'un jeton utilisateur
CN106356097A (zh) * 2016-08-25 2017-01-25 浙江宇视科技有限公司 一种防止数据丢失的保护方法和装置
CN106356097B (zh) * 2016-08-25 2020-02-14 浙江宇视科技有限公司 一种防止数据丢失的保护方法和装置
CN114201115A (zh) * 2021-12-14 2022-03-18 北京达佳互联信息技术有限公司 数据存储系统、方法、计算机设备及存储介质

Also Published As

Publication number Publication date
WO2016032955A3 (fr) 2016-04-21

Similar Documents

Publication Publication Date Title
US9781227B2 (en) Lockless distributed redundant storage and NVRAM caching of compressed data in a highly-distributed shared topology with direct memory access capable interconnect
US8725934B2 (en) Methods and appratuses for atomic storage operations
US20190073296A1 (en) Systems and Methods for Persistent Address Space Management
EP2598996B1 (fr) Appareil, système et procédé d'opérations de stockage conditionnel et atomique
US8898376B2 (en) Apparatus, system, and method for grouping data stored on an array of solid-state storage elements
US10127166B2 (en) Data storage controller with multiple pipelines
US8495284B2 (en) Wear leveling for low-wear areas of low-latency random read memory
JP6208156B2 (ja) ハイブリッドストレージ集合体の複製
US8782344B2 (en) Systems and methods for managing cache admission
US10019320B2 (en) Systems and methods for distributed atomic storage operations
US10810123B1 (en) Flush strategy for using DRAM as cache media system and method
US10019352B2 (en) Systems and methods for adaptive reserve storage
US9251052B2 (en) Systems and methods for profiling a non-volatile cache having a logical-to-physical translation layer
EP2802991B1 (fr) Systèmes et procédés pour la gestion de l'admission dans une antémémoire
US20180081821A1 (en) Metadata Management in a Scale Out Storage System
US20140006685A1 (en) Systems, methods, and interfaces for managing persistent data of atomic storage operations
US20150095696A1 (en) Second-level raid cache splicing
US20210311652A1 (en) Using Segment Pre-Allocation to Support Large Segments
US20210311653A1 (en) Issuing Efficient Writes to Erasure Coded Objects in a Distributed Storage System with Two Tiers of Storage
US11467746B2 (en) Issuing efficient writes to erasure coded objects in a distributed storage system via adaptive logging
US8402247B2 (en) Remapping of data addresses for large capacity low-latency random read memory
WO2016032955A2 (fr) Systèmes de stockage prenant en charge la nvram
US20180307419A1 (en) Storage control apparatus and storage control method
US11314809B2 (en) System and method for generating common metadata pointers

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15834910

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC , EPO FORM 1205A DATED 26.06.2017

122 Ep: pct application non-entry in european phase

Ref document number: 15834910

Country of ref document: EP

Kind code of ref document: A2