WO2016032955A2 - Systèmes de stockage prenant en charge la nvram - Google Patents
Systèmes de stockage prenant en charge la nvram Download PDFInfo
- Publication number
- WO2016032955A2 WO2016032955A2 PCT/US2015/046534 US2015046534W WO2016032955A2 WO 2016032955 A2 WO2016032955 A2 WO 2016032955A2 US 2015046534 W US2015046534 W US 2015046534W WO 2016032955 A2 WO2016032955 A2 WO 2016032955A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nvram
- storage
- data
- persistent
- raid
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C8/00—Arrangements for selecting an address in a digital store
- G11C8/06—Address interface arrangements, e.g. address buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1004—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1441—Resetting or repowering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates generally to storage systems, and, more specifically, to non-volatile random access memory (NVRAM) enabled storage systems.
- NVRAM non-volatile random access memory
- SSDs Solid-state devices
- RAID Redundant Array of Independent Disks
- the RAID system may take the form of a hardware RAID card, RAID on a Chip, software RAID, Erasure Coding, or JBOD (Just a Bunch of Disks).
- Transactional applications typically issue read and write requests (I/O requests) that have small transfer sizes and are not in sequential block address order (collectively referred to as "random" I/O requests).
- SSDs typically service random read requests many times faster than traditional hard disk drives (HDDs).
- SSD write amplification also reduces write performance and SSD endurance.
- An SSD is comprised of a plurality of flash pages. An entire flash page must be "erased" before it can be rewritten. There is a write cycle limit to how many times a flash page can be erased and rewritten.
- a transactional application writes to an SSD in a RAID system
- its write request size will likely be much smaller than the SSD's flash page size, resulting in partially written flash pages. Consequently the SSD has to perform garbage collection by moving user data from one partially written flash page to another until an entire flash page contains no more user data and can be erased.
- Garbage collection turns each application write into multiple SSD writes, also known as write amplification. Given the write cycle limit on each flash page, write amplification significantly reduces SSD endurance and write performance. When an application or a storage system writes to an SSD in multiple small transfer sizes in sequential block address order (sequential writes), the SSD typically can fill entire flash pages with fewer partially written pages, reducing the write amplification during garbage collection.
- An SSD typically comprises a persistent flash medium for storing data and a volatile memory to hold data temporarily before the data is committed to the persistent flash medium.
- a persistent flash medium for storing data
- a volatile memory to hold data temporarily before the data is committed to the persistent flash medium.
- the data stored in the volatile memory will be lost.
- some of the SSDs are equipped with a capacitor or battery, which provides enough power for flushing the data stored in the volatile memory to the persistent flash medium.
- the additional capacitor can significantly increase the cost of the SSDs.
- a storage system is configured to prevent data loss in the event of power failure.
- the storage system comprises a processor, one or more storage devices and a non-volatile memory (NVRAM).
- Each of the one or more storage devices comprises a persistent medium.
- the NVRAM device is configured to store one or more data blocks to be sent to a storage device for persistent storage.
- the processor is configured to check whether a data block stored on the NVRAM is also stored on the storage device's persistent medium before deleting the data block from the NVRAM.
- a storage system is configured to reduce read-modify- write operations and write amplification.
- the storage system comprises a processor, a RAID system with one or more storage devices, a NVRAM device and a memory.
- the NVRAM device stores one or more data blocks that are to be sent to the RAID system for persistent storage.
- the memory stores a metadata that maps every data block's logical block address (LBA) to its physical block address (PBA).
- LBA logical block address
- PBA physical block address
- the processor is configured to handle random write requests from an application. When handling random write requests, the processor first stores a data block in the NVRAM, then
- the full RAID stripe is written to the RAID system to reduce read-modify-write operations.
- the metadata is updated to map the LBAs of the one or more data blocks to the PBAs of the one or more data blocks.
- the one or more data blocks are deleted from the NVRAM after the metadata has been updated.
- Figure 1 i lustrates a block diagram of a storage system with NVRAM devices
- Figure 2 i lustrates a block diagram of a storage device with volatile memory and persistent medium.
- Figure 3 lustrates the deferred write process at the NVRAM.
- Figure 4 lustrates the deferred write process at the storage device.
- Figure 5 lustrates a flow diagram of the check block persistent process.
- Figure 6 lustrates a flow diagram of the make block persistent process.
- Figure 7 lustrates a block diagram of a storage system with a RAID system.
- Figure 8 lustrates a block diagram of RAID system data layout.
- Figure 9 lustrates a flow diagram of NVRAM enabled writes to a RAID system.
- Figure 10 illustrates a flow diagram of NVRAM enabled metadata updates.
- a storage system has at least one NVRAM device to accomplish (1 ) preventing data loss in the event of a power failure; (2) reducing read-write-modify operations; and (3) reducing solid-state device write amplification.
- Fig. 1 illustrates one embodiment of a storage system 100 that includes a processor 1 1 0 and one or more storage devices 120.
- Examples of storage device include solid-state device (SSD), hard disk drive (HDD), and a combination of SSDs and HDDs (Hybrid).
- the storage system 100 provides persistent storage to one or more user applications 140.
- the storage device 120 may be accessible by multiple storage systems 100 as shared storage device.
- the application 140 and the storage system 100 may be running on the same physical system.
- the application 140 may access the storage system through a storage network such as FibreChannel, Ethernet, InfiniBand, and PCIe.
- the processor 1 1 0 interfaces between the application 140 and the storage device 120.
- the processor 1 1 0 controls and manages the storage device 120.
- the processor 1 1 0 may provide a set of commands for the application 140 to read from and write to the storage device 120.
- the processor 1 10 can provide redundancy, performance, and data services that often can't be achieved by the storage device 120.
- the storage system 100 includes one or more non- volatile random-access memory (NVRAM) devices 1 30.
- NVRAM non- volatile random-access memory
- Examples of NVRAM include battery-backed DRAM, NVDIMM, PCIe NVRAM card, and solid-state device.
- the processor 1 10 upon receiving a write request from the application 140, stores the write data in the NVRAM 130 and acknowledges to the application 140 that the write request is successful before the data is actually committed to the storage device 120. This process is known as deferred write.
- Fig. 2 illustrates one embodiment of a storage device 1 20, such as a solid-state device (SSD), that comprises a persistent medium 172 for storing data and a volatile memory 174 for buffering data temporarily before the data is committed to the persistent medium 172.
- SSD solid-state device
- the data stored in the volatile memory 174 will be lost.
- an SSD is equipped with a capacitor or battery, which provides enough power to write all the data in the volatile memory 174 to the persistent medium 172.
- the capacitor or battery can significantly increase the cost of the SSD.
- the present disclosure provides methods for preventing data loss during power failure without the additional capacitor or battery in the SSD.
- Fig. 3 illustrates the deferred write process at the NVRAM 130:
- Step 1 The processor 1 10 receives a write request from the application 140;
- Step 2 The processor 1 10 commits the write data to the NVRAM 130;
- Step 3 The processor 1 10 acknowledges to the application 140 that the write is successful
- Step 4 At a later time the processor 1 10 writes the data in the NVRAM 130 to the storage device 120 (deferred write);
- Step 5 At a later time the processor 1 10 deletes the data from the NVRAM 1 30 so the NVRAM space can be reused.
- Fig. 4 illustrates the deferred write process at the storage device 120: Step 4': The storage device 120 receives a write request from the processor; Step 6: The storage device 1 20 stores the write data in its volatile memory 174; Step 7: The storage device 1 20 acknowledges to the processor 1 1 0 that the write is successful;
- Step 8 At a later time the storage device 120 writes the data in the volatile memory 174 to its persistent medium 172
- step 5 If a power failure takes place after step 5 but before step 8, the write data will be lost. In order to prevent data loss the present disclosure replaces step 5 with the following steps as illustrated in Fig. 5:
- Step 510 The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its NVRAM space can be reused;
- Step 520 The processor 1 10 checks if the data block is on the storage device's persistent medium 172 by issuing a "check block persistent" request to the storage device 120;
- Step 530 If the storage device responds "yes", the processor 1 10 deletes the data block from the NVRAM 130.
- Step 540 If the storage device responds "no", the processor 1 10 issues a "flush block” request to the storage device 120. Upon receiving the request, the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172 and acknowledges completion to the processor 1 10;
- Step 550 Upon receiving the acknowledgement the processor 1 10 deletes the data block from the NVRAM 130.
- Fig. 6 illustrates another embodiment of the present disclosure by replacing step 5 with the following steps:
- Step 610 The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its space can be reused;
- Step 620 The processor 1 10 issues "make block persistent" request to the storage device 120. Upon receiving the request, the storage device 120 checks if the data block is on its persistent medium. If not the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172. The storage device then acknowledges completion to the processor 1 1 0. Step 630: The processor 1 10 receives the completion acknowledgement for "make block persistent" request;
- Step 640 The processor 1 10 deletes the data block from the NVRAM 130.
- the NVRAM device 130 is configured to be much larger than the aggregate size of the volatile memory in the storage devices 120. This allows the processor 1 10 to delay deleting a data from the NVRAM as much as possible so the data is more likely to have been flushed to the persistent medium 1 72 before a check block consistent or make block persistent request.
- the storage device 120 always writes data from its volatile memory 174 to its persistent medium 172 first in first out (FIFO). In these embodiments the processor may calculate whether a data is on the persistent medium without the check block persistent or make block persistent request.
- Fig. 7 illustrates one embodiment of a storage system 100 that includes a RAID (Redundant Array of Independent Disks) system 150 between the processor 1 10 and the storage device 120.
- RAID system includes software RAID, hardware RAID card, RAID on a chip, Erasure Coding, or JBOD (Just a Bunch of Disks).
- the RAID system may be configured in write through (WT) or write back (WB) mode.
- the RAID system 150 virtualizes multiple storage devices 120 into logical units.
- the RAID system 150 may be implemented to distribute data blocks across the storage devices 120 (i.e., striping) and generate checksums (i.e., parity) for data redundancy and recovery.
- the RAID system introduces read- modify-write operations that reduce write performance.
- the storage devices 120 are SSDs. Small transactional writes to a SSD causes write amplification, which reduces SSD endurance and write performance.
- the processor 1 10 maintains Metadata 160 that maps every data block's LBA (Logical Block Address) to its PBA (Physical Block Address).
- LBA is the virtual block address assigned and accessed by the application 140 whereas PBA represents the block's physical location in the RAID system 150.
- the metadata 1 60 may map every data block's LBA to its content ID (e.g. content fingerprint) and every content ID to its PBA.
- Fig. 8 illustrates one embodiment wherein the storage devices 1 20 in the RAID system 150 are managed as one or more Beads 186. Each Bead comprises one or more contiguous RAID stripes 184. Each RAID stripe comprises one chunk 182 from each storage device.
- the processor 1 1 0 is configured to write to fill one or more Beads (current Beads) before writing to new Beads.
- Fig. 9 illustrates the write data flow:
- Step 710 Upon receiving an application write request, the processor 1 1 0 commits the write data in the NVRAM 130 and acknowledges completion to the application 140.
- the processor 1 10 accumulates one or more data blocks in the
- Step 720 The processor 1 10 checks if the current Bead is filled
- Step 730 If not the processor 1 10 writes the full RAID stripe in one or more transfers after the existing RAID stripes in the current Bead;
- Step 740 If yes the processor 1 1 0 writes the RAID stripe in one or more transfers at the beginning of a new Bead;
- Step 750 The processor 1 10 updates the metadata 160 to map the LBA of each data block in the RAID stripe to its PBA;
- Step 760 At a later time the processor 1 10 deletes the data blocks from the NVRAM so their NVRAM space can be reused
- the above write data flow ensures that the RAID system 150 receives mostly full stripe write (FSW) requests, which cause fewer or no read-modify-write operations. It also ensures that most data is written to each SSD in contiguous chunks (sequential writes), which reduces the SSD's write amplification.
- FSW stripe write
- the processor 1 10 makes metadata 160 persistent by writing metadata updates to the RAID system 150.
- Metadata updates are typically of small transfer sizes and are another source for read-modify-writes and SSD write amplification.
- Fig. 10 illustrates the metadata update data flow for reducing read-modify- writes and write amplification:
- Step 810 The processor 1 10 commits metadata update in the NVRAM 1 30;
- Step 820 The processor 1 10 accumulates one or more metadata updates in the NVRAM into one RAID stripe 184;
- Step 830 The processor 1 10 writes the RAID stripe in one or more transfers to the current Bead or a new Bead;
- Step 840 The processor 1 10 updates the metadata index to the PBA of the on- disk metadata structure;
- the metadata updates have their own Beads separate from data Beads. In other embodiments the metadata updates are mixed with data in the same Beads.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Security & Cryptography (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
L'invention concerne un procédé permettant d'empêcher la perte de données, qui comprend d'abord l'écriture de données sur un dispositif de mémoire vive non volatile (NVRAM), puis sur un dispositif de stockage à mémoire volatile et support persistant, et la vérification du fait que des données sont sur le support persistant avant de supprimer les données de la NVRAM. Un procédé de réduction de lecture-modification-écriture et d'amplification d'écriture est également divulgué, lequel consiste d'abord à écrire des données sur un dispositif NVRAM, à accumuler des données dans des bandes complètes avant d'écrire les données sur un système RAID en bandes complètes et sur chaque dispositif de stockage en unités contiguës de manière séquentielle.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201462041318P | 2014-08-25 | 2014-08-25 | |
US62/041,318 | 2014-08-25 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2016032955A2 true WO2016032955A2 (fr) | 2016-03-03 |
WO2016032955A3 WO2016032955A3 (fr) | 2016-04-21 |
Family
ID=55400802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2015/046534 WO2016032955A2 (fr) | 2014-08-25 | 2015-08-24 | Systèmes de stockage prenant en charge la nvram |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016032955A2 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106356097A (zh) * | 2016-08-25 | 2017-01-25 | 浙江宇视科技有限公司 | 一种防止数据丢失的保护方法和装置 |
WO2018022091A1 (fr) * | 2016-07-29 | 2018-02-01 | Hewlett-Packard Development Company, L.P. | Déverrouillage de dispositifs de stockage lisibles par machine à l'aide d'un jeton utilisateur |
CN114201115A (zh) * | 2021-12-14 | 2022-03-18 | 北京达佳互联信息技术有限公司 | 数据存储系统、方法、计算机设备及存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8074019B2 (en) * | 2007-11-13 | 2011-12-06 | Network Appliance, Inc. | Preventing data loss in a storage system |
US7761740B2 (en) * | 2007-12-13 | 2010-07-20 | Spansion Llc | Power safe translation table operation in flash memory |
US10346095B2 (en) * | 2012-08-31 | 2019-07-09 | Sandisk Technologies, Llc | Systems, methods, and interfaces for adaptive cache persistence |
US9081712B2 (en) * | 2012-12-21 | 2015-07-14 | Dell Products, L.P. | System and method for using solid state storage systems as a cache for the storage of temporary data |
-
2015
- 2015-08-24 WO PCT/US2015/046534 patent/WO2016032955A2/fr active Application Filing
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018022091A1 (fr) * | 2016-07-29 | 2018-02-01 | Hewlett-Packard Development Company, L.P. | Déverrouillage de dispositifs de stockage lisibles par machine à l'aide d'un jeton utilisateur |
CN106356097A (zh) * | 2016-08-25 | 2017-01-25 | 浙江宇视科技有限公司 | 一种防止数据丢失的保护方法和装置 |
CN106356097B (zh) * | 2016-08-25 | 2020-02-14 | 浙江宇视科技有限公司 | 一种防止数据丢失的保护方法和装置 |
CN114201115A (zh) * | 2021-12-14 | 2022-03-18 | 北京达佳互联信息技术有限公司 | 数据存储系统、方法、计算机设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
WO2016032955A3 (fr) | 2016-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9781227B2 (en) | Lockless distributed redundant storage and NVRAM caching of compressed data in a highly-distributed shared topology with direct memory access capable interconnect | |
US8725934B2 (en) | Methods and appratuses for atomic storage operations | |
US20190073296A1 (en) | Systems and Methods for Persistent Address Space Management | |
EP2598996B1 (fr) | Appareil, système et procédé d'opérations de stockage conditionnel et atomique | |
US8898376B2 (en) | Apparatus, system, and method for grouping data stored on an array of solid-state storage elements | |
US10127166B2 (en) | Data storage controller with multiple pipelines | |
US8495284B2 (en) | Wear leveling for low-wear areas of low-latency random read memory | |
JP6208156B2 (ja) | ハイブリッドストレージ集合体の複製 | |
US8782344B2 (en) | Systems and methods for managing cache admission | |
US10019320B2 (en) | Systems and methods for distributed atomic storage operations | |
US10810123B1 (en) | Flush strategy for using DRAM as cache media system and method | |
US10019352B2 (en) | Systems and methods for adaptive reserve storage | |
US9251052B2 (en) | Systems and methods for profiling a non-volatile cache having a logical-to-physical translation layer | |
EP2802991B1 (fr) | Systèmes et procédés pour la gestion de l'admission dans une antémémoire | |
US20180081821A1 (en) | Metadata Management in a Scale Out Storage System | |
US20140006685A1 (en) | Systems, methods, and interfaces for managing persistent data of atomic storage operations | |
US20150095696A1 (en) | Second-level raid cache splicing | |
US20210311652A1 (en) | Using Segment Pre-Allocation to Support Large Segments | |
US20210311653A1 (en) | Issuing Efficient Writes to Erasure Coded Objects in a Distributed Storage System with Two Tiers of Storage | |
US11467746B2 (en) | Issuing efficient writes to erasure coded objects in a distributed storage system via adaptive logging | |
US8402247B2 (en) | Remapping of data addresses for large capacity low-latency random read memory | |
WO2016032955A2 (fr) | Systèmes de stockage prenant en charge la nvram | |
US20180307419A1 (en) | Storage control apparatus and storage control method | |
US11314809B2 (en) | System and method for generating common metadata pointers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15834910 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC , EPO FORM 1205A DATED 26.06.2017 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15834910 Country of ref document: EP Kind code of ref document: A2 |