WO2015072925A1 - Method for hot i/o selective placement and metadata replacement for non-volatile memory cache on hybrid drive or system - Google Patents

Method for hot i/o selective placement and metadata replacement for non-volatile memory cache on hybrid drive or system Download PDF

Info

Publication number
WO2015072925A1
WO2015072925A1 PCT/SG2014/000534 SG2014000534W WO2015072925A1 WO 2015072925 A1 WO2015072925 A1 WO 2015072925A1 SG 2014000534 W SG2014000534 W SG 2014000534W WO 2015072925 A1 WO2015072925 A1 WO 2015072925A1
Authority
WO
WIPO (PCT)
Prior art keywords
accordance
block
hot
metadata
nvm
Prior art date
Application number
PCT/SG2014/000534
Other languages
French (fr)
Inventor
Yong Hong WANG
Rajesh VELLORE ARUMUGAM
Chun Teck Lim
Kyawt Kyawt KHAING
Qingsong WEI
Cheng Chen
Jun Yang
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Publication of WO2015072925A1 publication Critical patent/WO2015072925A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0638Combination of memories, e.g. ROM and RAM such as to permit replacement or supplementing of words in one module by words in another module
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0616Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • G06F3/0649Lifecycle management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Definitions

  • the field of present invention relates generally to optimizing cache management on hybrid drive storage systems and metadata management on system with byte-addressable Non-volatile memory ⁇ NV ). More particularly, the present invention relates to methods and computer programs for hot I/O selective placement in non-volatile (NV) flash cache on hybrid drives, and metadata replacement in NVM on computer or storage system.
  • NV non-volatile
  • Hybrid disk drives are a kind of storage device that places a small amount of flash memory (nonvolatile SSD) inside the hard disk drive. Flash memory devices were originally developed for consumer electronics device and then applied to hybrid drives in data center enterprise storage systems, which is a new paradigm of offering hybrid storage solution advantages with optimized economic benefits.
  • the present invention describes a method for hot I/O selective placement, metadata organization and replacement on a hybrid drive or a system comprising a non-. volatile memory (NVM) and disk media.
  • the method includes sorting metadata on the NVM based on frequency and/or number of I/O requests.
  • the method includes identifying hot data and cold metadata on the NVM and the disk media in response to the sorting of the metadata.
  • the method further includes evicting cold metadata from NVM to disk while picking hot data for caching by moving from the disk media to the NVM.
  • FIG, 1 is an illustration diagram for a hybrid storage system in accordance with a present embodiment
  • FIG. 2 is a block diagram of a hybrid drive emulation framework in accordance with the present embodiment
  • FIG. 3 depicts a mechanism for dynamic hot data area selection in accordance with the present embodiment
  • FIG. 4 depicts a flow diagram of hot data area statistical profile setting in accordance with the present embodiment
  • FIG. 5 depicts a flow diagram of the working flow of a hot I/O selective algorithm for placement of data into a NV cache in accordance with the present embodiment
  • FIG. 6 illustrates an in-memory metadata management which only maintains used inodes in accordance with the present embodiment
  • FIG. 7 illustrates metadata organized as an inode-list in accordance with the present embodiment
  • FIG. 8 illustrates metadata organized as a block-list in accordance with the present embodiment
  • FIG. 9 illustrates metadata organized as duallists in accordance with the present embodiment: an inode list and a block shadow list;
  • FIG. 10 illustrates a -distance inode hotness search for a victim shadow block in accordance with the present embodiment
  • FIG. 11 illustrates an in-memory metadata structure organized as a tree and a dual-list in accordance with the present embodiment
  • FIG. 12 illustrates a cold inode block staging to a disk in accordance with the present embodiment.
  • FIG. 1 illustrates the environment of applying ah array of hybrid disk drives 106 into an enterprise storage system 104, which is used to provision storage services to a large number of applications.
  • a hybrid storage system with flash storage provided as NV cache is a type of storage architecture that can deliver multiple benefits in terms of performance improvement, reliability enhancement, and power consumption reduction.
  • a hybrid drive manager 108 needs to decide what and how to put frequently accessed data or hot data into such cache.
  • the NV cache capacity is relatively larger than conventional RAM based cache, the data record has to be persistent.
  • This NV cache arrangement demonstrates different characteristics as compared with conventional cache methods, such as the cost of putting a data record into NV cache is higher because it results in a flash memory write, such as persistent meta-data update, and such as wear leveling processing. Therefore, in accordance with the present embodiment, methods of selective cache placement in order to improve cache performance are introduced.
  • a selective cache placement algorithm and functionality is provided in the hybrid drive manager 108, which is incorporated in the storage system controller 04.
  • FIG. 2 demonstrates the framework of a hybrid drive cache manager emulation setup 200.
  • a software module on Linux was developed in order to emulate a working environment of a hybrid drive in accordance with the present embodiment.
  • Real storage devices including standalone Hard Disk Drive (HDD) and Solid State Drive (SSD), are applied to handle I/O requests which are prepared from a reputable workload trace 202.
  • This emulation setting contains the design of cache management in terms of cache information record interval tree operation and incoming data statistics measurement.
  • One real Serial AT Attachment (SAT A) HDD and one SSD are attached to emulate the working environment of a hybrid drive. With this emulation framework, some improved results of applying the algorithm of present embodiment to a hybrid drive cache management 208 are obtained, which confirm the design with targeted benefit.
  • SAT A real Serial AT Attachment
  • hybrid drive statistics table 206 is used as repository to keep application data traffic statistics that helps hybrid drive manager to make decision for the placement of hot I/O request into NV flash memory.
  • statistics table 206 it keeps both cumulative I/O statistics and a measuring window based active I/O statistics information for selective cache placement in NV cache on a hybrid drive.
  • FIG. 3 depicts a mechanism for dynamic hot data area selection using cache placement algorithm in accordance with the present embodiment
  • a hybrid drive manager monitors and measures I/O hotness statics across a set of pre-set address zones.
  • address zones with fixed equal size e.g. 1GB
  • LBA Logical Block Addressing
  • All I/O requests that fall in one particularly zone will be kept track for the statistics measurement for that zone.
  • Hybrid drive manager will first filter out those sequential I/O requests, which is not suitable to be put into cache, as a hybrid drive, by design, does not deliver better performance than traditional HDD, moreover, putting sequential I/O request into cache will result in cache pollution.
  • the hot I/O candidate will be fed into a hybrid drive manager as input, and manager will perform hot I/O selection algorithm based on the information retrieved from statistics table.
  • the manager will first update the cumulative measurement 304 for the zone 302 that this incoming I/O falls in and compare with the threshold that is derived across all other statistical zones.
  • Such threshold can be considered as the average I/O cumulative measurement of all zones within a hybrid drive.
  • An I/O is only selected when the cumulative measurement, such as the total number of incoming I/O, for that zone, is above the threshold and can be put into NV flash by manager later.
  • Such measurement is an effective approach that can capture the characteristics of application work load in a coarse way, which means if the total number of I/O within a particular zone is significant, the future cacheable I/O could be likely arrived again within this hot zone.
  • this measurement is not sufficient, as it miss the measurement for a relatively short period active workload.
  • Even when an address zone may not have the cumulative I/O measurement that is, by above definition, treated as hot zone, there is the situation that within an active working period, certain type of I/O will be demonstrating high degree of repetitiveness, thus such measurements are also required to be captured.
  • a hybrid drive manager Apart from a cumulative I/O measurement, a hybrid drive manager also processes a method to capture hot I/O from the measurement within an active window 306, which is a period of time.
  • An active window 306 can be restarted periodically once the total number of I/O reaching a hybrid drive equals a pre-decided maximum value. It is equivalent to form a small active hot area within each non-hot zone. If an input I/O falls in such hot area, then this I/O will also be considered as hot and be ready to put into flash cache.
  • Such kind of active measurement demonstrates strong linkage to cumulative measurement. The larger of the cumulative measurement of a non-hot zone 302, the wider of this active hot area within this zone 302.
  • a simple method of calculating the width of an active hot area with below formula is provided. It is exponentially reverse proportion to the ratio of cumulative measurement of a zone to the average cumulative measurement of all valid zones. If the cumulative measurement of that zone is equal to or larger than the average cumulative measurement, then the width of active hot area actually becomes the full zone size, which is a perfect match with the cumulative measurement selective method.
  • AIO spa Zone*, *2 ⁇ ⁇ 0 "* fz I ° ⁇
  • FIG. 4 and FIG. 5 depict flow diagrams of setting hot data statistical characteristics in statistics table and select hot I/O for cache placement in accordance with the present embodiment.
  • FIG. illustrates the update of hot I/O statistics in the table (flowchart 400) in accordance with the present embodiment.
  • hybrid drive manager When a possible hot I/O, e.g. random I/O, is input into hybrid storage system (step 402), hybrid drive manager will first locate address zone that contains current input I/O (step 404). Based on the profile of application workload, not every zone that contains the I/O input. If current I/O is the first arriving I/O to the identified zone (step 406), then the manager will update its knowledge of total non-empty I/O zones, which is the number of valid zones (step 410). The manager also begins to update the cumulative I/O statistics, which comprises of the information for both hybrid drive disk and current zone (step 408).
  • manager will also perform the active measurement update for one active window, which also includes the information for both hybrid drive disk and current zone (step 410).
  • active measurement the number of l/Os and average address of l/Os are kept for each zone.
  • manager maintains a counter for period of one active window. Once this counter reaches a pre-defined maximum value, hybrid drive manager needs to reset and re-start a new active window (step 414 and step 4 6).
  • the hybrid drive manager has the better knowledge of hot area statistics profile and make an informed decision to select hot I/O for caching (step 420).
  • FIG. 5 describes the hot I/O selective method for placement in NV cache (flowchart 500) in accordance with the present embodiment.
  • hybrid drive manager When a possible hot I/O, e.g. random I/O, is input into hybrid storage system (step 502), hybrid drive manager will first locate address zone that contains current input I/O (step 504). As the statistics measurement has already built, manager can acquire cumulative measurement information for current address zone and check if the zone is hot (step 506). Manager can set the current I/O is hot for NV cache placement if the current zone is hot (step 510). Otherwise, manager will further check the active measurement, and identify the active hot area within this non-hot zone (step 514). If current I/O is within the active hot area, manager can still select this I/O (step 516) and place it into cache (step 520).
  • a possible hot I/O e.g. random I/O
  • hybrid drive manager can take less overhead per I/O in terms of moving and evicting (pin/unpin) operations, including metadata updates to the NV cache of the hybrid drive, without sacrificing the cache hit rate.
  • Such kind of reduction in pin operations results in less wear out in the hybrid drive NV flash. This improves the lifetime of the internal Flash.
  • reduction in pin operations might lead to reduced Garbage Collection (GC) activity by the Flash Translation layer (FTL) of the internal Flash. This should lead to improved performance in terms of Input / Output Operations per second (IOPS) due to lesser interference of GC with I/O.
  • GC Garbage Collection
  • FTL Flash Translation layer
  • a metadata on disk is organized as block 600 in accordance with the present embodiment.
  • Current in-memory metadata is organized as block as well for metadata flushing back to disk.
  • This block-based approach results in memory space waste.
  • One memory page (4096Bytes) will be allocated even for one inode (128Bytes).
  • byteOaddressable approach is used to just maintain used mode, excluding unused inodes.
  • FIG. 7 shows an inode-based list and replacement 700 in accordance with the present embodiment. Ordering and replacement are based on inode 706, The coldest inode will be replaced (inode at tail). This approach has good temporal locality (efficient to identify hot inode and cold inode) However, it does not consider spatial locality. Block device such as disk SSD does not support single inode write, only a block write.
  • FIG. 8 shows a block-based replacement 800 in accordance with the present embodiment. Inodes 806 belonging to one block 804 is organized as a group. Ordering and replacement is based on block. Any inode access will result in the whole block moving to the head.
  • Hot block may contain code inodes which result in low memory usage.
  • the other problem is early eviction. Hot inodes in the victim block will be flushes. This may result in whole block read for one inode miss which is expensive penalty.
  • FIG. 9 shows a dual list as an example of Locality-aware Metadata Organization and Replacement 900 in accordance with the present embodiment.
  • the dual list includes an inode list 902 and a block shadow list 904.
  • metadata inode 906 is maintained as a LRU list (fine granularity) where a recently accessed inode is placed at the head of the list.
  • Block Shadow List 904 inodes are grouped based on block association to capture spatial locality (gross granularity). It is a shadow list, not real inode. List is ordered on the base of block hotness.
  • Block Hotness indicates the hotness of a block containing multiple inodes.
  • the Block Hotness is able to leverage both temporal and spatial locality.
  • This dual-list embodiment has two advantages over single-list architectures. Firstly, the dual-list embodiment provides high memory utilization. Secondly, the dual-list embodiment enables better sequentiality over single-list architectures.
  • FIG. 10 shows an example of a K-distance inode hotness search 1000 in accordance with the present embodiment.
  • a block on the tail of the shadow block list will be selected to be flushed to disk when more memory space is needed. Instead of evict all the inodes of block, hot inodes in this block will be kept in the non-volatile memory (NVM).
  • NVM non-volatile memory
  • K-distance inode search is done. An inode list is searched for K-distance from head 1006. If found the inode within K-distance, this inode is hot and kept in the NVM. Otherwise, this inode will be evicted 1008.
  • FIG. 11 shows an in-memory metadata structure as combination of a tree and a duallist for fast inode lookup and locating block 1100 in accordance with the present embodiment.
  • an inode number is obtained by searching namespace (step 1104).
  • Corresponding block number is calculated based on block size and inode size (step 1106).
  • With block number corresponding shadow block 1 14 can be quickly found by searching the Tress from root 1108.
  • the inode can be found on the inode list 1116. The accessed inode will be moved to head and its shadow block including block hotness will be updated accordingly.
  • FIG. 12 shows cold inode block staging to disk 1200 in accordance with the present embodiment.
  • a block on the tail of the shadow block list is selected as a victim block. All the inode of this block will be flushed into disk sequentially. Hot inode of this victim will be kept in the NVM 1202. But other inodes of this victim are removed from NVM 1202 to make space for incoming metadata.
  • Hybrid Drive Emulator HDE
  • Host/Server Host I/O Candidate e.g. Random I/O
  • Host/Server Host I/O Candidate e.g. Random I/O

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present invention describes a method for hot I/O selective placement, metadata organization and replacement on a hybrid drive or a system comprising a non-volatile memory (NVM) and disk media. The method includes sorting metadata on the NVM based on frequency and/or number of I/O requests. The method includes identifying hot data and cold metadata on the NVM and the disk media in response to the sorting of the metadata. The method further includes evicting cold metadata from NVM to disk while picking hot data for caching by moving from the disk media to the NVM.

Description

Method for Hot I/O Selective Placement and Metadata Replacement For Non-volatile Memory Cache on Hybrid Drive or System
Field of Invention
[0001] The field of present invention relates generally to optimizing cache management on hybrid drive storage systems and metadata management on system with byte-addressable Non-volatile memory {NV ). More particularly, the present invention relates to methods and computer programs for hot I/O selective placement in non-volatile (NV) flash cache on hybrid drives, and metadata replacement in NVM on computer or storage system.
Background to the Invention
[0002] With the proliferation of solid state disk (SSD) storage incorporated into conventional storage systems, advantages can be achieved in terms of performance improvement, reliability enhancement, and power consumption reduction. Hybrid disk drives are a kind of storage device that places a small amount of flash memory (nonvolatile SSD) inside the hard disk drive. Flash memory devices were originally developed for consumer electronics device and then applied to hybrid drives in data center enterprise storage systems, which is a new paradigm of offering hybrid storage solution advantages with optimized economic benefits.
[0003] In hybrid storage systems with separate NV flash cache, hot data, or frequently accessed data, is conventionally stored in such cache. As the NV cache capacity is relatively larger than conventional RAM based cache, the data record can be maintained for a long period of time. The cost of putting a data record into NV cache is also high, as this results in flash memory write, persistent meta-data update, and wear leveling processing. Thus, it is important to select and store real hot data into flash cache in order to improve the cache hit rate.
[0004] Conventionally, cache management algorithms focus on how to identify I/O data that needs to be replaced from cache memory, but they fail to address cache placement, which is how to efficiently select hot I/O data before it is moved into cache. Furthermore, conventional Least Recently Used (LRU) types of cache-all algorithms do not meet the requirement of hybrid drive cache management, as the cost and complexity of control commands such as pin or unpin are expensive.
[0005] What is needed is a method for hot I/O selective placement by hybrid drive cache management which can help to improve operation efficiency and reduce data written to flash for data intensive applications in data centers. Furthermore, other desirable features and characteristics will become apparent from the subsequent detailed description and the appended claims, taken in conjunction with the accompanying drawings and this background of the disclosure.
[0006] File system performance is dominated by metadata access because it is small and popular. Metadata is stored as blocks on the hard disk drive. Partial metadata update results in whole block read or write which significantly amplifies disk I/O. Huge performance gap between CPU and disk aggravates this problem. With availability of byte-addressable Non-volatile Memory (NVM), putting metadata in NVM is able to accelerate file system. However, direct applying previous metadata organization is not cost-effective for persistent NVM. Fine grained metadata organization is desirable for NVM. On the other hand, metadata staging is necessary because NVM may not be enough to hold all the metadata because the volume of a file system is increasing NVM may incur physical failure. Summary of Invention
[0007] The present invention describes a method for hot I/O selective placement, metadata organization and replacement on a hybrid drive or a system comprising a non-. volatile memory (NVM) and disk media. The method includes sorting metadata on the NVM based on frequency and/or number of I/O requests. The method includes identifying hot data and cold metadata on the NVM and the disk media in response to the sorting of the metadata. The method further includes evicting cold metadata from NVM to disk while picking hot data for caching by moving from the disk media to the NVM.
Brief Description of Drawings
[0008] The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to illustrate various embodiments and to explain various principles and advantages in accordance with the present invention.
[0009] FIG, 1 is an illustration diagram for a hybrid storage system in accordance with a present embodiment;
[0010] FIG. 2 is a block diagram of a hybrid drive emulation framework in accordance with the present embodiment;
[0011] FIG. 3 depicts a mechanism for dynamic hot data area selection in accordance with the present embodiment;
[0012] FIG. 4 depicts a flow diagram of hot data area statistical profile setting in accordance with the present embodiment; [0013] FIG. 5 depicts a flow diagram of the working flow of a hot I/O selective algorithm for placement of data into a NV cache in accordance with the present embodiment;
[0014] FIG. 6 illustrates an in-memory metadata management which only maintains used inodes in accordance with the present embodiment;
[0015] FIG. 7 illustrates metadata organized as an inode-list in accordance with the present embodiment;
[0016] FIG. 8 illustrates metadata organized as a block-list in accordance with the present embodiment;
[0017] FIG. 9 illustrates metadata organized as duallists in accordance with the present embodiment: an inode list and a block shadow list;
[0018] FIG. 10 illustrates a -distance inode hotness search for a victim shadow block in accordance with the present embodiment;
[0019] FIG. 11 illustrates an in-memory metadata structure organized as a tree and a dual-list in accordance with the present embodiment; and
[0020] FIG. 12 illustrates a cold inode block staging to a disk in accordance with the present embodiment.
[0021] Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures illustrating the hybrid drive system may be exaggerated relative to other elements to help to improve understanding of the present and alternate embodiments.
Detailed Description
[0022] The following detailed description is merely exemplary in nature and is not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description.
[0023] FIG. 1 illustrates the environment of applying ah array of hybrid disk drives 106 into an enterprise storage system 104, which is used to provision storage services to a large number of applications. As mentioned above, a hybrid storage system with flash storage provided as NV cache is a type of storage architecture that can deliver multiple benefits in terms of performance improvement, reliability enhancement, and power consumption reduction. With separate NV flash cache, a hybrid drive manager 108 needs to decide what and how to put frequently accessed data or hot data into such cache. As the NV cache capacity is relatively larger than conventional RAM based cache, the data record has to be persistent. This NV cache arrangement demonstrates different characteristics as compared with conventional cache methods, such as the cost of putting a data record into NV cache is higher because it results in a flash memory write, such as persistent meta-data update, and such as wear leveling processing. Therefore, in accordance with the present embodiment, methods of selective cache placement in order to improve cache performance are introduced.
For example, a selective cache placement algorithm and functionality is provided in the hybrid drive manager 108, which is incorporated in the storage system controller 04.
[0024] FIG. 2 demonstrates the framework of a hybrid drive cache manager emulation setup 200. A software module on Linux was developed in order to emulate a working environment of a hybrid drive in accordance with the present embodiment. Real storage devices, including standalone Hard Disk Drive (HDD) and Solid State Drive (SSD), are applied to handle I/O requests which are prepared from a reputable workload trace 202. This emulation setting contains the design of cache management in terms of cache information record interval tree operation and incoming data statistics measurement. One real Serial AT Attachment (SAT A) HDD and one SSD are attached to emulate the working environment of a hybrid drive. With this emulation framework, some improved results of applying the algorithm of present embodiment to a hybrid drive cache management 208 are obtained, which confirm the design with targeted benefit.
[0025] In this illustrated framework, one of the key components is the hybrid drive statistics table 206. It is used as repository to keep application data traffic statistics that helps hybrid drive manager to make decision for the placement of hot I/O request into NV flash memory. In such statistics table 206, it keeps both cumulative I/O statistics and a measuring window based active I/O statistics information for selective cache placement in NV cache on a hybrid drive.
[0026] FIG. 3 depicts a mechanism for dynamic hot data area selection using cache placement algorithm in accordance with the present embodiment, in the present embodiment, a hybrid drive manager monitors and measures I/O hotness statics across a set of pre-set address zones. For the illustration purpose, address zones with fixed equal size, e.g. 1GB, are allocated shown as Logical Block Addressing (LBA) Address Zones 302. All I/O requests that fall in one particularly zone will be kept track for the statistics measurement for that zone. Hybrid drive manager will first filter out those sequential I/O requests, which is not suitable to be put into cache, as a hybrid drive, by design, does not deliver better performance than traditional HDD, moreover, putting sequential I/O request into cache will result in cache pollution. After that, the hot I/O candidate will be fed into a hybrid drive manager as input, and manager will perform hot I/O selection algorithm based on the information retrieved from statistics table. The manager will first update the cumulative measurement 304 for the zone 302 that this incoming I/O falls in and compare with the threshold that is derived across all other statistical zones. Such threshold can be considered as the average I/O cumulative measurement of all zones within a hybrid drive. An I/O is only selected when the cumulative measurement, such as the total number of incoming I/O, for that zone, is above the threshold and can be put into NV flash by manager later. Such measurement is an effective approach that can capture the characteristics of application work load in a coarse way, which means if the total number of I/O within a particular zone is significant, the future cacheable I/O could be likely arrived again within this hot zone. However, this measurement is not sufficient, as it miss the measurement for a relatively short period active workload. Even when an address zone may not have the cumulative I/O measurement that is, by above definition, treated as hot zone, there is the situation that within an active working period, certain type of I/O will be demonstrating high degree of repetitiveness, thus such measurements are also required to be captured.
[0027] Apart from a cumulative I/O measurement, a hybrid drive manager also processes a method to capture hot I/O from the measurement within an active window 306, which is a period of time. An active window 306 can be restarted periodically once the total number of I/O reaching a hybrid drive equals a pre-decided maximum value. It is equivalent to form a small active hot area within each non-hot zone. If an input I/O falls in such hot area, then this I/O will also be considered as hot and be ready to put into flash cache. Such kind of active measurement demonstrates strong linkage to cumulative measurement. The larger of the cumulative measurement of a non-hot zone 302, the wider of this active hot area within this zone 302. In the present embodiment, a simple method of calculating the width of an active hot area with below formula is provided. It is exponentially reverse proportion to the ratio of cumulative measurement of a zone to the average cumulative measurement of all valid zones. If the cumulative measurement of that zone is equal to or larger than the average cumulative measurement, then the width of active hot area actually becomes the full zone size, which is a perfect match with the cumulative measurement selective method.
AIOspa = Zone*, *2~ ^0"* fz I°^
i [0028] FIG. 4 and FIG. 5 depict flow diagrams of setting hot data statistical characteristics in statistics table and select hot I/O for cache placement in accordance with the present embodiment.
[0029] FIG. illustrates the update of hot I/O statistics in the table (flowchart 400) in accordance with the present embodiment. When a possible hot I/O, e.g. random I/O, is input into hybrid storage system (step 402), hybrid drive manager will first locate address zone that contains current input I/O (step 404). Based on the profile of application workload, not every zone that contains the I/O input. If current I/O is the first arriving I/O to the identified zone (step 406), then the manager will update its knowledge of total non-empty I/O zones, which is the number of valid zones (step 410). The manager also begins to update the cumulative I/O statistics, which comprises of the information for both hybrid drive disk and current zone (step 408). After that, manager will also perform the active measurement update for one active window, which also includes the information for both hybrid drive disk and current zone (step 410). For active measurement, the number of l/Os and average address of l/Os are kept for each zone. Particularly, for active measurement, manager maintains a counter for period of one active window. Once this counter reaches a pre-defined maximum value, hybrid drive manager needs to reset and re-start a new active window (step 414 and step 4 6). After the statistics update (step 418), the hybrid drive manager has the better knowledge of hot area statistics profile and make an informed decision to select hot I/O for caching (step 420).
[0030] FIG. 5 describes the hot I/O selective method for placement in NV cache (flowchart 500) in accordance with the present embodiment. When a possible hot I/O, e.g. random I/O, is input into hybrid storage system (step 502), hybrid drive manager will first locate address zone that contains current input I/O (step 504). As the statistics measurement has already built, manager can acquire cumulative measurement information for current address zone and check if the zone is hot (step 506). Manager can set the current I/O is hot for NV cache placement if the current zone is hot (step 510). Otherwise, manager will further check the active measurement, and identify the active hot area within this non-hot zone (step 514). If current I/O is within the active hot area, manager can still select this I/O (step 516) and place it into cache (step 520).
[0031] By applying such selective placement, hybrid drive manager can take less overhead per I/O in terms of moving and evicting (pin/unpin) operations, including metadata updates to the NV cache of the hybrid drive, without sacrificing the cache hit rate. Such kind of reduction in pin operations (up to 100%) results in less wear out in the hybrid drive NV flash. This improves the lifetime of the internal Flash. Furthermore, reduction in pin operations might lead to reduced Garbage Collection (GC) activity by the Flash Translation layer (FTL) of the internal Flash. This should lead to improved performance in terms of Input / Output Operations per second (IOPS) due to lesser interference of GC with I/O.
[0032] In FIG. 6, a metadata on disk is organized as block 600 in accordance with the present embodiment. Current in-memory metadata is organized as block as well for metadata flushing back to disk. This block-based approach results in memory space waste. One memory page (4096Bytes) will be allocated even for one inode (128Bytes). To save memory space, byteOaddressable approach is used to just maintain used mode, excluding unused inodes.
[0033] FIG. 7 shows an inode-based list and replacement 700 in accordance with the present embodiment. Ordering and replacement are based on inode 706, The coldest inode will be replaced (inode at tail). This approach has good temporal locality (efficient to identify hot inode and cold inode) However, it does not consider spatial locality. Block device such as disk SSD does not support single inode write, only a block write. [0034] FIG. 8 shows a block-based replacement 800 in accordance with the present embodiment. Inodes 806 belonging to one block 804 is organized as a group. Ordering and replacement is based on block. Any inode access will result in the whole block moving to the head. If a block is selected as a victim, all inode of this block will be replaced. It has good spatial locality (block association among inodes). However, there are also problems for a block-based replacement. One of the problems is Memory pollution. Hot block may contain code inodes which result in low memory usage. The other problem is early eviction. Hot inodes in the victim block will be flushes. This may result in whole block read for one inode miss which is expensive penalty.
[0035] In view of the above problems, a dual-list based replacement is proposed in the present embodiment. FIG. 9 shows a dual list as an example of Locality-aware Metadata Organization and Replacement 900 in accordance with the present embodiment. The dual list includes an inode list 902 and a block shadow list 904. In an inode list 902, metadata inode 906 is maintained as a LRU list (fine granularity) where a recently accessed inode is placed at the head of the list. In a Block Shadow List 904, inodes are grouped based on block association to capture spatial locality (gross granularity). It is a shadow list, not real inode. List is ordered on the base of block hotness. Block Hotness indicates the hotness of a block containing multiple inodes. The Block Hotness is able to leverage both temporal and spatial locality. This dual-list embodiment has two advantages over single-list architectures. Firstly, the dual-list embodiment provides high memory utilization. Secondly, the dual-list embodiment enables better sequentiality over single-list architectures.
[0036] FIG. 10 shows an example of a K-distance inode hotness search 1000 in accordance with the present embodiment. In the K-distance inode hotness search, a block on the tail of the shadow block list will be selected to be flushed to disk when more memory space is needed. Instead of evict all the inodes of block, hot inodes in this block will be kept in the non-volatile memory (NVM). To identify hot inodes of the evicted block, K-distance inode search is done. An inode list is searched for K-distance from head 1006. If found the inode within K-distance, this inode is hot and kept in the NVM. Otherwise, this inode will be evicted 1008.
[0037] FIG. 11 shows an in-memory metadata structure as combination of a tree and a duallist for fast inode lookup and locating block 1100 in accordance with the present embodiment. When open a file (step 1102), an inode number is obtained by searching namespace (step 1104). Corresponding block number is calculated based on block size and inode size (step 1106). With block number, corresponding shadow block 1 14 can be quickly found by searching the Tress from root 1108. Following the pointer and inode offset, the inode can be found on the inode list 1116. The accessed inode will be moved to head and its shadow block including block hotness will be updated accordingly.
[0038] FIG. 12 shows cold inode block staging to disk 1200 in accordance with the present embodiment. A block on the tail of the shadow block list is selected as a victim block. All the inode of this block will be flushed into disk sequentially. Hot inode of this victim will be kept in the NVM 1202. But other inodes of this victim are removed from NVM 1202 to make space for incoming metadata.
[0039] It should further be appreciated that the exemplary embodiments are only examples, and are not intended to limit the scope, applicability, dimensions, or configuration of the invention in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment of the invention, it being understood that various changes may be made in the function and arrangement of elements and method of fabrication described in an exemplary embodiment without departing from the scope of the invention as set forth in the appended claims.
[0040] Reference numerals 100 Hybrid Storage System
102 Application Servers
104 Enterprise Storage System
106 Hybrid Disk Array
108 RAID controller with Hybrid Drive Manager
200 Hybrid drive emulation framework
202 I/O input
204 I/O FIFO Queue
206 Statistics Table
208 Hybrid Drive Cache manager (HD )
210 Hybrid Drive Emulator (HDE)
212 Raw Device Access
300 Mechanism for dynamic hot data area selection
302 LBA Address Zoning
304 Cumulative I/O profiling
306 Active Window I/O profiling
400 Flowchart for hot data area statistical profile setting
402 i/O from Host/Server (Hot I/O Candidate e.g. Random I/O)
404 Get current address zone
406 Check whether the first I/O is in this zone
408 Set cumulative I/O number for both Hybrid disk and current zone
410 Update total non-empty I/O Zone number
412 Set active window I/O number for Hybrid disk
414 Check whether Re-start Active window now
416 Reset active window for all address zones
418 Set zone LBA address average and I/O number for current active Window 420 Direct this I/O to Hybrid Disk Drive Cache manager
500 Flowchart for hot I/O selective algorithm for placement in NV cache
502 I/O from Host/Server (Hot I/O Candidate e.g. Random I/O)
504 Get current address zone
506 Compare number of I/O in current zone with average number of I/O of all zones
508 Check whether the number is larger than average
510 Set hot flag on current I/O
512 Calculate hot data area for current non-hot zone
514 Check whether current I/O is in hot data area
516 Set hot flag on current I/O
5 8 Clear hot flag on current I/O
520 Direct this I/O to Hybrid Disk Drive
600 In-memory metadata management which only maintains used inodes
602 Inode Bitmap
604 Memory
606 File system Layout
700 Metadata organized as an inode-list
702 NVM space
704 Inode LRU List
706 Inode
800 Metadata organized as block-list
802 NVM space
804 Block
806 Inode
900 Metadata organized as duallists: an inode list and a block shadow list 902 inode List
904 Block Shadow List
906 Inode
908 Inode shadow
910 Block: Hotness 3
912 Block: Hotness 2
914 Block: Hotness 1
1000 K-distance inode hotness search for the victim shadow block
1002 Inode List
1004 Block Shadow List
1006 K-distance for inode hotness search
1008 Block selected by K-distance inode hotness search
1100 In-memory metadata structure organized as a tree and a dual-list
1102 Open a Filename
1104 Get Inode number
1106 Calculate Block number
1108 Root Node
1110 Middle Node
1112 Leaf Node
1114 Block Shadow List
1116 Inode List
1200 Cold inode block staging to disk
1202 Non Volatile Memory
1204 Memory Controller
1206 Non Volatile Memory File System
1208 I/O Controller

Claims

CLAIMS What is claimed is:
1. A method for hot I O selective placement, metadata organization and replacement on a hybrid drive or a system comprising a non-volatile memory (NVM) and disk media, the method comprising:
sorting metadata on the NVM based on frequency and/or number of I/O requests; identifying hot data and cold metadata on the NVM and the disk media in response to the sorting of the metadata; and
evicting cold metadata from NVM to disk while picking hot data for caching by moving from the disk media to the NVM.
2. The method in accordance with claim 1 wherein the sorting step comprises
receiving an I/O request addressed to a specific area of a data block region of the hybrid drive;
obtaining the number of I/O requests in the specific area;
comparing the number of I/O requests in the specific area with a threshold value; and updating a hot flag information of the specific area based on the results of the comparing step.
3. The method in accordance with claim 2,
wherein the threshold value is an average of the number of I/O requests across all other statistical zones.
4. The method in accordance with claim 2,
wherein the hot flag information of the specific area is saved in a statistics table of the hybrid drive.
5. The method in accordance with claim 2,
wherein the number of I/O requests in the specific area is saved in a statistics table of the hybrid drive.
6. The method in accordance with claim 5, further comprising
updating information regarding the number of I/O requests in the specific area in a statistics table of the hybrid drive.
7. The method in accordance with claim 2, further comprising
me suring the number of I O requests within an active window,
wherein the active window is a period of time which restarts periodically once the total number of I/O requests reaching a hybrid drive equals a predetermined maximum value.
8. The method in accordance with claim 7, further comprising
calculating the width of the active window using the following formula:
AIO = Zone■ *2- cro<n¾ ;Cft¾J
9. The method in accordance with claim 1, wherein the sorting step comprises
mamtaining metadata/inodes on the NVM in a Least Recently Used (LRU) list; identifying hot inodes within the LRU list;
defining a shadow block list of the inodes by grouping the inodes based on their block association; and
identifying hot data zones comprising one or more blocks within the shadow block list that are associated with the hot inodes.
10. The method in accordance with claim 9, wherein the identifying step comprises identifying hot inodes within a K-distance from a head of the LRU list
11. The method in accordance with claim 10 further comprising selecting a block on a tail of the shadow block list to be flushed to the disk media when more memory space is needed in the VM.
12. The method in accordance with claim 11 further comprising flushing inodes of the selected block onto the disk media sequentially.
13. The method in accordance with claim 12 further comprising removing the flushed inodes from the NVM to make space for mcoming metadata.
14. The method in accordance with claim 1, further comprising:
obtaining an inode number for a file by opening a filename associated with the file; calculating a block number corresponding to the inode number;
finding a shadow block in a block shadow list in response to the block number; and finding an inode in the inode list in response to the shadow block.
15. The method in accordance with claim 14, wherein the calculating step is based on a block size.
16. The method in accordance with claim 14, wherein the calculation step is based on an mode size.
17. The method in accordance with claim 14, wherein the finding corresponding shadow block step comprises searching a tree structure from a root of the tree using the block number.
18. The method in accordance with claim 14, wherein the finding inodes step comprises following pointers in the corresponding shadow block.
19. The method in accordance with claim 14, wherein the finding inodes step comprises following inode offsets in the corresponding shadow block.
PCT/SG2014/000534 2013-11-14 2014-11-14 Method for hot i/o selective placement and metadata replacement for non-volatile memory cache on hybrid drive or system WO2015072925A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG201308424 2013-11-14
SG201308424-9 2013-11-14

Publications (1)

Publication Number Publication Date
WO2015072925A1 true WO2015072925A1 (en) 2015-05-21

Family

ID=53057743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2014/000534 WO2015072925A1 (en) 2013-11-14 2014-11-14 Method for hot i/o selective placement and metadata replacement for non-volatile memory cache on hybrid drive or system

Country Status (1)

Country Link
WO (1) WO2015072925A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107728938A (en) * 2017-09-18 2018-02-23 暨南大学 A kind of cold data Placement Strategy based on frequency association under low energy consumption cluster environment
CN107844269A (en) * 2017-10-17 2018-03-27 华中科技大学 A kind of layering mixing storage system and method based on uniformity Hash
CN109960668A (en) * 2017-12-22 2019-07-02 爱思开海力士有限公司 The semiconductor devices that wear leveling for managing non-volatile memory part operates
CN110727399A (en) * 2015-09-18 2020-01-24 华为技术有限公司 Storage array management method and device
CN112667149A (en) * 2020-12-04 2021-04-16 北京浪潮数据技术有限公司 Data heat sensing method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100088459A1 (en) * 2008-10-06 2010-04-08 Siamak Arya Improved Hybrid Drive
US20100268874A1 (en) * 2006-06-30 2010-10-21 Mosaid Technologies Incorporated Method of configuring non-volatile memory for a hybrid disk drive
US20110153931A1 (en) * 2009-12-22 2011-06-23 International Business Machines Corporation Hybrid storage subsystem with mixed placement of file contents
US20120317338A1 (en) * 2011-06-09 2012-12-13 Beijing Fastweb Technology Inc. Solid-State Disk Caching the Top-K Hard-Disk Blocks Selected as a Function of Access Frequency and a Logarithmic System Time
US20130091319A1 (en) * 2011-10-05 2013-04-11 Byungcheol Cho Cross-boundary hybrid and dynamic storage and memory context-aware cache system
US8560759B1 (en) * 2010-10-25 2013-10-15 Western Digital Technologies, Inc. Hybrid drive storing redundant copies of data on disk and in non-volatile semiconductor memory based on read frequency

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100268874A1 (en) * 2006-06-30 2010-10-21 Mosaid Technologies Incorporated Method of configuring non-volatile memory for a hybrid disk drive
US20100088459A1 (en) * 2008-10-06 2010-04-08 Siamak Arya Improved Hybrid Drive
US20110153931A1 (en) * 2009-12-22 2011-06-23 International Business Machines Corporation Hybrid storage subsystem with mixed placement of file contents
US8560759B1 (en) * 2010-10-25 2013-10-15 Western Digital Technologies, Inc. Hybrid drive storing redundant copies of data on disk and in non-volatile semiconductor memory based on read frequency
US20120317338A1 (en) * 2011-06-09 2012-12-13 Beijing Fastweb Technology Inc. Solid-State Disk Caching the Top-K Hard-Disk Blocks Selected as a Function of Access Frequency and a Logarithmic System Time
US20130091319A1 (en) * 2011-10-05 2013-04-11 Byungcheol Cho Cross-boundary hybrid and dynamic storage and memory context-aware cache system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727399A (en) * 2015-09-18 2020-01-24 华为技术有限公司 Storage array management method and device
CN110727399B (en) * 2015-09-18 2021-09-03 华为技术有限公司 Storage array management method and device
CN107728938A (en) * 2017-09-18 2018-02-23 暨南大学 A kind of cold data Placement Strategy based on frequency association under low energy consumption cluster environment
CN107728938B (en) * 2017-09-18 2020-06-16 暨南大学 Cold data placement strategy based on frequency correlation under low-energy-consumption cluster environment
CN107844269A (en) * 2017-10-17 2018-03-27 华中科技大学 A kind of layering mixing storage system and method based on uniformity Hash
CN107844269B (en) * 2017-10-17 2020-06-02 华中科技大学 Hierarchical hybrid storage system based on consistent hash
CN109960668A (en) * 2017-12-22 2019-07-02 爱思开海力士有限公司 The semiconductor devices that wear leveling for managing non-volatile memory part operates
CN109960668B (en) * 2017-12-22 2024-02-02 爱思开海力士有限公司 Semiconductor device for managing wear leveling operation of nonvolatile memory device
CN112667149A (en) * 2020-12-04 2021-04-16 北京浪潮数据技术有限公司 Data heat sensing method, device, equipment and medium
CN112667149B (en) * 2020-12-04 2023-12-29 北京浪潮数据技术有限公司 Data heat sensing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US10922235B2 (en) Method and system for address table eviction management
EP3210121B1 (en) Cache optimization technique for large working data sets
US8161241B2 (en) Temperature-aware buffered caching for solid state storage
US9135181B2 (en) Management of cache memory in a flash cache architecture
EP2476055B1 (en) Apparatus, system, and method for caching data on a solid-state storage device
KR101726824B1 (en) Efficient Use of Hybrid Media in Cache Architectures
US9779027B2 (en) Apparatus, system and method for managing a level-two cache of a storage appliance
US9417808B2 (en) Promotion of partial data segments in flash cache
US20140115241A1 (en) Buffer management apparatus and method
US9298616B2 (en) Systems and methods for tracking working-set estimates with a limited resource budget
KR101297442B1 (en) Nand flash memory including demand-based flash translation layer considering spatial locality
CN105917318A (en) System and method for implementing SSD-based I/O caches
Lee et al. Eliminating periodic flush overhead of file I/O with non-volatile buffer cache
WO2015072925A1 (en) Method for hot i/o selective placement and metadata replacement for non-volatile memory cache on hybrid drive or system
Liu et al. Raf: A random access first cache management to improve SSD-based disk cache
US9396128B2 (en) System and method for dynamic allocation of unified cache to one or more logical units
Zhou et al. Understanding and alleviating the impact of the flash address translation on solid state devices
Chiueh et al. Software orchestrated flash array
He et al. Improving update-intensive workloads on flash disks through exploiting multi-chip parallelism
Liu et al. FLAP: Flash-aware prefetching for improving SSD-based disk cache
Bang et al. A memory hierarchy-aware metadata management technique for Solid State Disks
Gu et al. Hotis: A hot data identification scheme to optimize garbage collection of ssds
Lee et al. Mining-based File Caching in a Hybrid Storage System.
Zhou et al. Leveraging semantic links for high efficiency page-level ftl design
Xu Improving flash translation layer performance by using log block mapping scheme and two-level buffer for address translation information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14861891

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14861891

Country of ref document: EP

Kind code of ref document: A1