WO2020113549A1 - External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives - Google Patents

External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives Download PDF

Info

Publication number
WO2020113549A1
WO2020113549A1 PCT/CN2018/119731 CN2018119731W WO2020113549A1 WO 2020113549 A1 WO2020113549 A1 WO 2020113549A1 CN 2018119731 W CN2018119731 W CN 2018119731W WO 2020113549 A1 WO2020113549 A1 WO 2020113549A1
Authority
WO
WIPO (PCT)
Prior art keywords
smr
zone
storage subsystem
data
staging
Prior art date
Application number
PCT/CN2018/119731
Other languages
French (fr)
Inventor
Jianjian Huo
Yikang Xu
Shu Li
Jinbo Wu
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to PCT/CN2018/119731 priority Critical patent/WO2020113549A1/en
Publication of WO2020113549A1 publication Critical patent/WO2020113549A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0616Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • G06F3/0649Lifecycle management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B5/00Recording by magnetisation or demagnetisation of a record carrier; Reproducing by magnetic means; Record carriers therefor
    • G11B5/012Recording on, or reproducing or erasing from, magnetic disks

Definitions

  • This disclosure is generally related to data storage devices and systems implementing the shingled magnetic recording technology. More specifically, this disclosure is related to a method and system that improves the performance of shingled magnetic recording (SMR) hard disk drives (HDDs) .
  • SMR shingled magnetic recording
  • HDDs hard disk drives
  • user data may be stored in different tiers, sometimes referred to as “hot” storage and “cold” storage.
  • Hot storage
  • cold storage
  • Frequently accessed data is often stored in the more costly hot storage tier
  • less frequently accessed or archival data e.g., user-uploaded photos and videos, enterprise bookkeeping data for long-term archival purposes, surveillance recordings, etc.
  • less expensive cold storage tier e.g., user-uploaded photos and videos, enterprise bookkeeping data for long-term archival purposes, surveillance recordings, etc.
  • SSDs solid-state drives
  • HDDs hard disk drives
  • SMR Shingled magnetic recording
  • HDDs hard disk drives
  • Conventional hard disk drives record data by writing non-overlapping magnetic tracks parallel to each other (perpendicular recording) , while shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track narrower and allowing for higher track density. Thus, the tracks partially overlap similar to roof shingles. Due to their low cost and high area density, implementing SMR drives for archival storage can lower the storage price per GB and the total cost of ownership (TCO) .
  • TCO total cost of ownership
  • One embodiment described herein provides a method and system for managing a shingled magnetic recording (SMR) -based storage system.
  • the system receives a data object to be stored in the SMR-based storage system, which can include a staging storage subsystem and a main storage subsystem.
  • the main storage subsystem can include a plurality of SMR hard disk drives (HDDs) .
  • the system can store the received data object in one or more zone files residing on the staging storage subsystem and, in response to determining that a size of a respective zone file reaches a predetermined threshold, write the zone file to an SMR zone located in an SMR HDD within the main storage subsystem.
  • SMR shingled magnetic recording
  • the staging storage subsystem can include a first cluster of storage servers, and the main storage subsystem can include a second cluster of storage servers.
  • the first cluster of storage servers can include one or more of: a solid-state drive (SSD) -based storage server and a conventional magnetic recording (CMR) hard disk drive (HDD) -based storage server.
  • SSD solid-state drive
  • CMR magnetic recording
  • HDD hard disk drive
  • the size of the zone file substantially equals a size of an SMR zone.
  • the staging storage subsystem performs garbage-collection operations for the main storage subsystem, in response to determining that a garbage-collection condition is met.
  • determining that the garbage-collection condition is met can include determining that a ratio of empty SMR zones in the main storage subsystem is less than a predetermined threshold and determining that a number of received read requests to the SMR-based storage system is less than a predetermined threshold.
  • performing the garbage-collection operations can include selecting, from the main storage subsystem, an SMR zone having a highest ratio of invalid data; reading out, from the selected SMR zone, valid data; storing the valid data in a zone file residing on the staging storage subsystem; and erasing all data in the selected SMR zone to free the selected SMR zone.
  • the system updates an object-mapping table subsequent to storing the received data object in the one or more zone files.
  • system can further receive an object-read request, look up the object-mapping table to identify one or more zone files corresponding to the object-read request, identify one or more SMR zones within the main storage subsystem corresponding to the one or more identified zone files, and retrieve a requested data object from the identified SMR zones.
  • storing the received data object can further include grouping a plurality of received data objects based on their respective life expectancies and storing a group of data objects having a similar life expectancies in one or more zone files corresponding to the life expectancy.
  • the SMR-based data storage system can include a main storage subsystem, which can include a plurality of SMR hard disk drives (HDDs) and a staging storage subsystem.
  • the staging storage subsystem can be configured to receive a to-be-stored data object; store the received data object in one or more zone files residing on the staging storage subsystem; and in response to determining that a size of a respective zone file reaches a predetermined threshold, write the zone file to an SMR zone located in an SMR HDD within the main storage subsystem.
  • FIG. 1 illustrates the structures of a conventional hard disk and a shingled magnetic recording (SMR) hard disk.
  • SMR shingled magnetic recording
  • FIG. 2 presents a diagram illustrating an exemplary architecture of a storage system, according to one embodiment.
  • FIG. 3 presents a flowchart illustrating exemplary read and write processes, according to one embodiment.
  • FIG. 4 presents a flowchart illustrating an exemplary garbage-collection process, according to one embodiment.
  • FIG. 5 illustrates an exemplar y scenario for grouping data objects based on life expectancies, according to one embodiment.
  • FIG. 6 illustrates an exemplary external-staging module, according to one embodiment.
  • FIG. 7 illustrates an exemplary main storage module, according to one embodiment.
  • FIG. 8 conceptually illustrates an electronic system with which some embodiments of the subject technology are implemented.
  • Embodiments of the present invention solve the problem of improving the performance of HM SMR HDDs by implementing an external staging storage cluster as an access interface for front-end users of an SMR-based main storage cluster.
  • the external staging storage cluster can use an SSD or a conventional magnetic recording (CMR) HDD for storage and can have a much smaller capacity than the main storage cluster. More specifically, incoming objects-writes can be staged or accumulated in the external staging storage cluster and stored in a new zone file. Once the new zone file reaches the capacity of an SMR zone (e.g., 256 MB) , it can migrate from the external staging storage cluster to the main storage cluster for permanent storage.
  • an SMR zone e.g., 256 MB
  • the system avoids the performance degradation caused by the inability of parallel writing of the SMR. Moreover, by aligning the size of the zone files to the size of the SMR zones, the SMR-based main storage cluster no longer needs to manage garbage collection. On the other hand, the external staging storage cluster can run garbage collection (GC) tasks as background processes. Moreover, objects can be grouped based on their life expectancies, thus reducing the overall GC activities and write amplification in the main storage cluster.
  • GC garbage collection
  • FIG. 1 illustrates the structures of a conventional hard disk and a shingled magnetic recording (SMR) hard disk. More specifically, the left drawing shows the structure of a conventional hard disk 102, and the right drawing shows the structure of an SMR hard disk 104.
  • Conventional hard disk 102 can have sectors (e.g., sector 106) of equal sizes sitting in line on a track, and there is a gap (e.g., gap 108) between adjacent tracks. These gaps are considered a waste of space.
  • SMR hard disk 104 enables higher track density by allowing adjacent tracks to overlap one another, eliminating the gap between adjacent tracks.
  • HM SMR-managed SMR HDDs can provide significant benefit in increasing storage density; however, they can also bring lots of challenges. More particularly, HM SMR HDDs require strict adherence to a special protocol by the host server. Because the host server manages the shingled nature of the SMR HDDs, it is required to write sequentially in order to avoid damage to existing data.
  • HM SMR HDDs do not allow update-in-place for SMR zones; hence, any data update/delete activities will incur garbage collection, which will increase data write amplification and lower the performance of the storage system. In some cases, the GC activities may even stall front-end users'input/output (IO) requests and make the storage system unavailable for access. Moreover, due to the sequential nature of the HM SMR HDD, concurrent writes or updates into an SMR HDD will have to write into multiple SMR zones in parallel, , thus leading to a performance degradation of between 10-and 20-fold.
  • FIG. 2 presents a diagram illustrating an exemplary architecture of a storage system, according to one embodiment.
  • Storage system 200 can include an SMR-based main storage cluster 202, which comprises a plurality of server nodes (e.g., server nodes 204 and 206) .
  • SMR-based main storage cluster 202 can be a distributed server cluster, which stores the complete copies of all archival data (also referred to as data objects) .
  • Data objects can be distributed into hundreds or thousands of servers using hashing or indexing methods.
  • each server node includes an HM SMR HDD for storage purposes.
  • server node 204 includes an HM SMR HDD 208.
  • Storage system 200 can also include an external staging storage cluster 210 coupled to SMR-based main storage cluster 202 via a network 212.
  • External staging storage cluster 210 can include one or more staging nodes (e.g., staging nodes 214 and 216) . Each staging node can use SSDs or CMR HDDs for storage purposes.
  • staging node 214 can include an SSD 218 and staging node 216 can include a CMR HDD 220.
  • the total storage capacity of external staging storage cluster 210 can be much smaller than that of main storage cluster 202. In some embodiments, the storage capacity of external staging storage cluster 210 can be between 2%and 5%of that of main storage cluster 202.
  • incoming data objects (e.g., to be written user data files) will be staged or accumulated in external staging storage cluster 210 and stored into a new zone file (e.g., zone file 220 or 222) , which can grow in size as more data objects come in.
  • zone file 220 or 222 e.g., zone file 220 or 222
  • this new zone file is full (e.g., its size reaches the size of the SMR zone) , it can be sent to main storage cluster 202 for permanent storage.
  • zone file 220 is full and will migrate to main storage cluster 202
  • zone file 222 is not yet full and will remain in external staging storage cluster 210.
  • external staging storage cluster 220 can maintain an object-mapping table internally. More specifically, the object-mapping table maps each data object to one or more zone files. Depending on the size of the data object, a single data object can reside in one zone file or multiple zone files. The object-mapping table facilitates the retrieval of the data objects during read operations.
  • FIG. 3 presents a flowchart illustrating exemplary read and write processes, according to one embodiment.
  • the system receives a user object IO request (operation 302) and determines the IO type of the request (operation 304) .
  • the external staging storage cluster can act as the interface for the user and receive the user IO request. If the IO request is an object-read request, the system performs a lookup in the object-mapping table (operation 306) and reads the object out (operation 308) .
  • the object-mapping table maps user data objects to zone files and is also maintained by the external staging storage cluster.
  • the main storage cluster maintains the mapping between the zone files and their physical locations (e.g., SMR zones) on the SMR HDDs. Retrieval of the zone files stored in the SMR HDDs can be handled by standard storage management modules residing on the main storage cluster.
  • the system then returns the object to the user (operation 310) and receives a new IO request (operation 302) .
  • the system writes the object into a zone file stored in the external staging storage cluster (operation 312) .
  • the system determines whether the size of the zone file reaches a predetermined threshold (e.g., the size of the zone file is substantially equal to the size of an SMR zone) (operation 314) . If so, the system locates an empty zone in a selected SMR HDD in the main storage cluster and writes the zone file into the empty zone (operation 316) . If not, the system receives a new IO request (operation 304) . More specifically, the zone file can be sent by the external staging storage cluster to the main storage cluster, which will then write the zone-size-aligned zone file into the SMR HDDs according to standard SMR protocols.
  • Garbage collection is a form of automatic storage management.
  • the garbage collector (or collector) attempts to reclaim garbage, referred to as storage space occupied by objects that are no longer in use.
  • the external staging storage cluster controls the garbage-collection process. More specifically, the external staging storage cluster can run one or more GC tasks as background processes. GC is needed when the free storage space in the main storage cluster is less than a predetermined threshold. However, GC is not automatically triggered when there is not enough storage space.
  • the system may only start GC when the read activities are light (e.g., the number of read requests are below a threshold) . Note that, the system does not need to consider write requests when determining whether to start GC, because the object write-requests are handled by the external staging storage cluster.
  • FIG. 4 presents a flowchart illustrating an exemplary garbage-collection process, according to one embodiment.
  • the system determines whether the free space in the main storage is below a predetermined threshold (operation 402) and determines whether the number of received read requests is below a predetermined threshold (operation 404) . If both the above requirements are met, the system can select one or more SMR zones that have the highest ratio of invalid data from all SMR HDDs in the main storage cluster for garbage collection (operation 406) . For example, the system may select the top one or few SMR zones for garbage collection. Otherwise, the system waits. The system can then perform the standard GC operations on those selected zones (operation 408) .
  • the GC operations can include reading out valid data from these zones and writing the valid data into a new zone file stored in the external staging storage cluster. Note that after the valid data is read out from the selected zones, these SMR zones can be freed and ready to receive new data.
  • the system can then determine whether the new zone file is full (operation 410) . If not, the system may select more SMR zones for GC (operation 406) . If so, the system can locate an empty zone on an SMR HDD belonging to the main storage cluster and write the full zone file into the selected zone (operation 412) . The system can then discard the previous GC zone files and update the object-mapping table (operation 414) . Because the size of the zone files is the same as the size of the SMR zones (e.g., 256 MB) , there is no need for the main storage cluster to manage the GC activities.
  • the GC activities mostly occur on the main storage cluster and do not interfere with the object-write operations, which happen at the external staging storage cluster. This means that GC and object-writes can happen concurrently. More particularly, the incoming data objects are written into the external staging storage cluster and do not compete for the throughput of the main storage cluster. Moreover, in embodiments of the present invention, the GC tasks are run as background processes and are performed only in times of light read requests. Consequently, the GC activities do not cause degradation to the storage system performance.
  • the system groups incoming data objects based on their life expectancy. Objects with similar life expectancies can be grouped together (e.g., saved to the same zone file) such that they can be stored in the same SMR zone. As a result, objects within a single zone may expire at roughly the same time, resulting in the zone file being picked up by GC. During GC, such a zone file may have most of its stored objects invalidated and deleted, thus increasing the efficiency of GC. This can reduce the overall need for GC and lower the write amplification.
  • FIG. 5 illustrates an exemplary scenario for grouping data objects based on life expectancy, according to one embodiment.
  • external staging storage cluster 502 can receive a large number of user data objects of different sizes and life expectancies, e.g., data objects 504, 506, and 508.
  • the life expectancy of a user data object is often determined by the user policy, which determines how long the data objects can be stored in the main storage cluster. Such information can be sent to the staging storage cluster along with the data objects.
  • the life expectancy of a data object can be included in a label or tag attached to the data object.
  • the data objects are represented using circles, the size of a circle indicating the size of the data object and the fill pattern of a circle indicating its life expectancy.
  • data objects having a hatched fill pattern e.g., data object 504
  • data objects having a dotted fill pattern e.g., data object 506
  • medium life expectancy e.g., between three and twelve months
  • blank data objects e.g., data object 508 can have the longest life expectancy (e.g., greater than 12 months) .
  • FIG. 5 shows that the user objects received by staging storage can be stored in different zone files based on their life expectancies.
  • external staging storage cluster 502 can simultaneously maintain multiple open zone files (e.g., zone files 510, 512, and 514) .
  • the system can determine its life expectancy (e.g., by checking a tag attached to the data object) and then store the data object into one of the open zone files based on the life expectancy of the data object.
  • data objects having the shortest life expectancy can be stored in zone file 510
  • data objects having a medium life expectancy can be stored in zone file 512
  • data objects having the longest life expectancy can be stored in zone file 514.
  • Some incoming data objects may have unknown life expectancies (e.g., the user policy does not set a time limit for storing those data objects) .
  • Those data objects may be stored in different zone files.
  • External staging storage cluster 502 also maintains an object-mapping table 516, which records the mapping between the user data objects and the zone files.
  • external staging storage cluster 502 can write the zone file to main storage cluster 520, which can include a number of SMR storage servers. More specifically, the full zone file can be distributed among and stored into these SMR storage servers using standard replication and erasure coding approaches.
  • Each SMR storage server can manage an SMR HDD, which can include a number of SMR zones.
  • Each SMR zone can accommodate a full zone file, which has the same size as the SMR zone. Because the zone files were created by grouping data objects of similar life expectancies together, each SMR zone stores data objects having similar life expectancies. For example, data objects in SMR zone 522 can all have a short life expectancy, whereas data objects in SMR zone 524 can all have a much longer life expectancy.
  • data objects stored in an SMR zone can expire at roughly the same time, resulting in the SMR zone having a high ratio of invalid data and being ready for GC.
  • all data objects stored in SMR zone 522 have a life expectancy of less than three months.
  • all data objects stored in SMR zone 522 expire and are ready to be deleted.
  • the garbage collector can erase all data objects from SMR zone 522, resulting in 100%GC efficiency. Because there is no valid data in SMR zone 522, there is no need to read out and write the valid data and, thus, no occurrence of write amplification for this example.
  • the system when writing the full zone files to the empty zones in main storage cluster 520, the system may also be configured to write zone files having similar life expectancies into adjacent empty zones, thus further enhancing the GC efficiency. This way, GC can be performed on multiple adjacent SMR zones simultaneously.
  • FIG. 6 illustrates an exemplary external-staging module, according to one embodiment.
  • External-staging module 600 can include an IO-request-receiving module 602, a write-process-management module 604, a read-process-management module 606, a GC-process-management module 608, and an object-mapping-table-management module 610.
  • external-staging module 600 can be implemented as a server cluster, which can include multiple servers that collectively act as a single storage system.
  • IO-request-receiving module 602 receives from a user process object-read and object-write requests. Depending on the request type (e.g., read or write) , IO-request-receiving module 602 can forward an IO request to write-process-management module 604 or read-process-manage module 606.
  • request type e.g., read or write
  • Write-process-management module 604 manages the object-write processes and can include a number of sub-modules, such as an object-life- expectancy-determination module 612, a zone-file-management module 614, and a write module 616.
  • Object-life-expectancy-determination module 612 can be responsible for determining the life expectancy of an incoming data object (e.g., based on a tag or label attached to the data object) .
  • Zone-file-management module 614 can be responsible for managing a plurality of zone files that can separately store the received data objects based on their life expectancies. More specifically, zone-file-management module 614 can place data objects having similar life expectancies into a same zone file.
  • Write module 616 can be responsible for writing a full zone file into the SMR-based main storage cluster.
  • Read-process-management module 606 manages the object-read processes and can include a number of sub-modules, such as a table-lookup module 618 and a read module 620.
  • Table-lookup module 618 can be responsible for looking up the object-mapping table maintained by object-mapping-table-management module 610 to identify the zone file (s) that include the requested data objects.
  • Read module 620 can be responsible for reading the data objects from the SMR-based main storage cluster.
  • GC-process-management module 608 can be responsible for managing GC tasks for the SMR-based main storage cluster. More specifically, GC-process-management module 608 can perform a number of GC-related operations, such as identifying SMR zones in the main storage cluster that have the top ratios of invalid data for GC, reading out valid data objects from those identified zones, and writing such data objects into corresponding zone files. Note that, when writing the valid data objects into the zone files, the system can consider the remaining life expectancies of the data objects in order to write them into zone files corresponding to their remaining life expectancies.
  • GC- process-management module 608 can also be responsible for determining whether the GC-triggering condition is met.
  • the GC-triggering condition can include the free space of the main storage cluster being below a threshold value and the current pending read requests being less than a threshold value. This can prevent interference of the GC activities to the object-reading processes.
  • FIG. 7 illustrates an exemplary main storage module, according to one embodiment.
  • main storage module 700 can be implemented as a cluster of SMR-based storage servers.
  • Main storage module 700 can include a read module 702 and a write module 704.
  • Read module 702 receives a read request from the external staging storage cluster, which specifies one or more zone files.
  • Read module 702 can locate the SMR zones storing these zone files and read data objects included in the zone files.
  • Write module 704 receives full zone files from the external-staging module, selects empty SMR zones in the SMR HDDs, and writes the zone files into the empty SMR zones.
  • embodiments of the present invention provide a solution for enhancing the performance of an SMR-based storage system.
  • the system instead of writing data objects directly to the main storage cluster, the system accumulates data in a external staging storage cluster, which can be based on SSD and CMR and allow for concurrent writing.
  • the accumulated data objects are placed in zone files that are size-aligned to the SMR zones, making it possible for the main storage cluster to delegate the management of its GC activities to the external staging storage cluster. Because data-write processes occur at the external staging storage cluster and the GC processes occur at the main storage cluster, these two types of process can happen simultaneously without competing for the main storage cluster throughput.
  • the external staging storage cluster which acts as an interface to users and receives the IO requests, can be configured to perform GC only at times of light reading activities. This can improve the main storage cluster's performance, latency and quality of service (QoS) . Finally, the external staging storage cluster can also group the received data objects based on their life expectancies, thus significantly reducing the amount of GC activities and write amplification on the main storage cluster.
  • the main storage cluster is based on SMR SSDs.
  • the solutions provided by embodiments of the present invention can also be used in other types of storage system that do not allow in-place-updates.
  • certain distributed storage system that uses SSDs with very limited erase cycles (e.g., quad-level cell (QLC) -or low-cost NAND-based SSDs) can also benefit from the various approaches described herein.
  • QLC quad-level cell
  • SSDs use super blocks to manage NAND chips and can exhibit the same GC requirements and interferences as SMR HDD. Therefore, the external staging cluster can use files that are size-aligned to the super blocks to stage received data objects.
  • Such an external staging cluster can be applied in these QLC-or low-cost SSD-based storage systems to perform various functions, including write staging, read redirection, and garbage collection.
  • FIG. 8 conceptually illustrates an electronic system with which some embodiments of the subject technology are implemented.
  • Electronic system 800 can be a client, a server, a computer, a smartphone, a PDA, a laptop, or a tablet computer with one or more processors embedded therein or coupled thereto, or any other sort of electronic device.
  • Such an electronic system includes various types of computer-readable media and interfaces for various other types of computer-readable media.
  • Electronic system 800 includes a bus 808, processing unit (s) 812, a system memory 804, a read-only memory (ROM) 810, a permanent storage device 802, an input device interface 814, an output device interface 806, and a network interface 816.
  • Bus 808 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of electronic system 800. For instance, bus 808 communicatively connects processing unit (s) 812 with ROM 810, system memory 804, and permanent storage device 802.
  • processing unit (s) 812 with ROM 810, system memory 804, and permanent storage device 802.
  • processing unit (s) 812 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure.
  • the processing unit (s) can be a single processor or a multi-core processor in different implementations.
  • ROM 810 stores static data and instructions that are needed by processing unit (s) 812 and other modules of the electronic system.
  • Permanent storage device 802 is a read-and-write memory device. This device is a non-volatile memory unit that stores instructions and data even when electronic system 800 is off. Some implementations of the subject disclosure use a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) as permanent storage device 802.
  • system memory 804 is a read-and-write memory device. However, unlike storage device 802, system memory 804 is a volatile read-and-write memory, such as a random-access memory. System memory 804 stores some of the instructions and data that the processor needs at runtime. In some implementations, the processes of the subject disclosure are stored in system memory 804, permanent storage device 802, and/or ROM 810. From these various memory units, processing unit (s) 812 retrieves instructions to execute and data to process in order to execute the processes of some implementations.
  • processing unit (s) 812 retrieves instructions to execute and data to process in order to execute the processes of some implementations.
  • Bus 808 also connects to input and output device interfaces 814 and 806.
  • Input device interface 814 enables the user to communicate information and send commands to the electronic system.
  • Input devices used with input device interface 814 include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices” ) .
  • Output device interface 806 enables, for example, the display of images generated by the electronic system 800.
  • Output devices used with output device interface 806 include, for example, printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD) .
  • Some implementations include devices such as a touchscreen that function as both input and output devices.
  • bus 808 also couples electronic system 800 to a network (not shown) through a network interface 816.
  • the computer can be a part of a network of computers (such as a local area network ( “LAN” ) , a wide area network ( “WAN” ) , or an intranet, or a network of networks, such as the Internet.
  • LAN local area network
  • WAN wide area network
  • intranet a network of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a method and system for managing a shingled magnetic recording (SMR) -based storage system. During operation, the system receives a data object (506, 508) to be stored in the SMR-based storage system, which can include a staging storage subsystem and a main storage subsystem. The main storage subsystem can include a plurality of SMR hard disk drives (HDDs). The system can store the received data object (504, 506, 508) in one or more zone files (510, 512, and 514) residing on the staging storage subsystem and, in response to determining that a size of a respective zone file (510, 512, and 514) reaches a predetermined threshold, write the zone file (510, 512, and 514) to an SMR zone (522, 524) located in an SMR HDD within the main storage subsystem.

Description

EXTERNAL STAGING STORAGE CLUSTER MECHANISM TO OPTIMIZE ARCHIVAL DATA STORAGE SYSTEM ON SHINGLED MAGNETIC RECORDING HARD DISK DRIVES
Inventors: Jianjian Huo, Yikang Xu, Shu Li, and Jinbo Wu
BACKGROUND Field
This disclosure is generally related to data storage devices and systems implementing the shingled magnetic recording technology. More specifically, this disclosure is related to a method and system that improves the performance of shingled magnetic recording (SMR) hard disk drives (HDDs) .
Related Art
In tiered storage systems, user data may be stored in different tiers, sometimes referred to as “hot” storage and “cold” storage. Frequently accessed data is often stored in the more costly hot storage tier, whereas less frequently accessed or archival data (e.g., user-uploaded photos and videos, enterprise bookkeeping data for long-term archival purposes, surveillance recordings, etc. ) is often stored in the less expensive cold storage tier.
As data storage in data centers has grown exponentially, more and more data has been moved to the cold storage. Consequently, data is rapidly piling up in archives. In fact, an industry survey indicated that 57%of all data needs to be stored 50 years or longer. Although solid-state drives (SSDs) are mostly used in data centers and cloud hot tier storage, hard disk drives (HDDs) still dominate warm and cold tier data storage, due to their low cost and good sequential throughputs.
Due to the scale of cold and archival storage, researchers and industries are constantly looking for different hardware and new approaches that can further lower the cost. Shingled magnetic recording (SMR) technology has gained popularity in data centers recently. SMR is a magnetic storage data recording technology used in hard disk drives (HDDs) to increase storage density and overall per-drive storage capacity. Conventional hard disk drives record data by writing non-overlapping magnetic tracks parallel to each other (perpendicular recording) , while shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track narrower and allowing for higher track density. Thus, the tracks partially overlap similar to roof shingles. Due to their low cost and high area density, implementing SMR drives for archival storage can lower the storage price per GB and the total cost of ownership (TCO) .
SUMMARY
One embodiment described herein provides a method and system for managing a shingled magnetic recording (SMR) -based storage system. During operation, the system receives a data object to be stored in the SMR-based storage system, which can include a staging storage subsystem and a main storage subsystem. The main storage subsystem can include a plurality of SMR hard disk drives (HDDs) . The system can store the received data object in one or more zone files residing on the staging storage subsystem and, in response to  determining that a size of a respective zone file reaches a predetermined threshold, write the zone file to an SMR zone located in an SMR HDD within the main storage subsystem.
In a variation on this embodiment, the staging storage subsystem can include a first cluster of storage servers, and the main storage subsystem can include a second cluster of storage servers.
In a further variation, the first cluster of storage servers can include one or more of: a solid-state drive (SSD) -based storage server and a conventional magnetic recording (CMR) hard disk drive (HDD) -based storage server.
In a variation on this embodiment, the size of the zone file substantially equals a size of an SMR zone.
In a variation on this embodiment, the staging storage subsystem performs garbage-collection operations for the main storage subsystem, in response to determining that a garbage-collection condition is met.
In a further variation, determining that the garbage-collection condition is met can include determining that a ratio of empty SMR zones in the main storage subsystem is less than a predetermined threshold and determining that a number of received read requests to the SMR-based storage system is less than a predetermined threshold.
In a further variation, performing the garbage-collection operations can include selecting, from the main storage subsystem, an SMR zone having a highest ratio of invalid data; reading out, from the selected SMR zone, valid data; storing the valid data in a zone file residing on the staging storage subsystem; and erasing all data in the selected SMR zone to free the selected SMR zone.
In a variation on this embodiment, the system updates an object-mapping table subsequent to storing the received data object in the one or more zone files.
In a further variation, the system can further receive an object-read request, look up the object-mapping table to identify one or more zone files  corresponding to the object-read request, identify one or more SMR zones within the main storage subsystem corresponding to the one or more identified zone files, and retrieve a requested data object from the identified SMR zones.
In a variation on this embodiment, storing the received data object can further include grouping a plurality of received data objects based on their respective life expectancies and storing a group of data objects having a similar life expectancies in one or more zone files corresponding to the life expectancy.
One embodiment can provide a shingled magnetic recording (SMR) -based data storage system. The SMR-based data storage system can include a main storage subsystem, which can include a plurality of SMR hard disk drives (HDDs) and a staging storage subsystem. The staging storage subsystem can be configured to receive a to-be-stored data object; store the received data object in one or more zone files residing on the staging storage subsystem; and in response to determining that a size of a respective zone file reaches a predetermined threshold, write the zone file to an SMR zone located in an SMR HDD within the main storage subsystem.
BRIEF DESCRIPTION OF THE FIGURES
FIG. 1 illustrates the structures of a conventional hard disk and a shingled magnetic recording (SMR) hard disk.
FIG. 2 presents a diagram illustrating an exemplary architecture of a storage system, according to one embodiment.
FIG. 3 presents a flowchart illustrating exemplary read and write processes, according to one embodiment.
FIG. 4 presents a flowchart illustrating an exemplary garbage-collection process, according to one embodiment.
FIG. 5 illustrates an exemplar y scenario for grouping data objects based on life expectancies, according to one embodiment.
FIG. 6 illustrates an exemplary external-staging module, according to one embodiment.
FIG. 7 illustrates an exemplary main storage module, according to one embodiment.
FIG. 8 conceptually illustrates an electronic system with which some embodiments of the subject technology are implemented.
In the figures, like reference numerals refer to the same figure elements.
DETAILED DESCRIPTION
The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Overview
Embodiments of the present invention solve the problem of improving the performance of HM SMR HDDs by implementing an external staging storage cluster as an access interface for front-end users of an SMR-based main storage cluster. The external staging storage cluster can use an SSD or a conventional magnetic recording (CMR) HDD for storage and can have a much smaller capacity than the main storage cluster. More specifically, incoming objects-writes can be staged or accumulated in the external staging storage cluster and stored in a new zone file. Once the new zone file reaches the capacity of an  SMR zone (e.g., 256 MB) , it can migrate from the external staging storage cluster to the main storage cluster for permanent storage. By staging the objects prior to writing them to the main storage, the system avoids the performance degradation caused by the inability of parallel writing of the SMR. Moreover, by aligning the size of the zone files to the size of the SMR zones, the SMR-based main storage cluster no longer needs to manage garbage collection. On the other hand, the external staging storage cluster can run garbage collection (GC) tasks as background processes. Moreover, objects can be grouped based on their life expectancies, thus reducing the overall GC activities and write amplification in the main storage cluster.
SMR-Based Data Storage Systems
FIG. 1 illustrates the structures of a conventional hard disk and a shingled magnetic recording (SMR) hard disk. More specifically, the left drawing shows the structure of a conventional hard disk 102, and the right drawing shows the structure of an SMR hard disk 104. Conventional hard disk 102 can have sectors (e.g., sector 106) of equal sizes sitting in line on a track, and there is a gap (e.g., gap 108) between adjacent tracks. These gaps are considered a waste of space. On the other hand, SMR hard disk 104 enables higher track density by allowing adjacent tracks to overlap one another, eliminating the gap between adjacent tracks.
In conventional SMR-based data storage systems (including single server or server cluster systems) , data update or delete activities and associated GC processes are handled by local storage software modules that run on the same host server of the SMR HDD. These systems are referred to as host-managed systems. Host-managed (HM) SMR HDDs can provide significant benefit in increasing storage density; however, they can also bring lots of challenges. More particularly, HM SMR HDDs require strict adherence to a special protocol by the host server. Because the host server manages the shingled nature of the SMR  HDDs, it is required to write sequentially in order to avoid damage to existing data. HM SMR HDDs do not allow update-in-place for SMR zones; hence, any data update/delete activities will incur garbage collection, which will increase data write amplification and lower the performance of the storage system. In some cases, the GC activities may even stall front-end users'input/output (IO) requests and make the storage system unavailable for access. Moreover, due to the sequential nature of the HM SMR HDD, concurrent writes or updates into an SMR HDD will have to write into multiple SMR zones in parallel, , thus leading to a performance degradation of between 10-and 20-fold.
In some embodiments, to avoid such a significant performance degradation, external staging storage (e.g., an external staging storage cluster) can be introduced to temporarily stage the data objects before they are stored into the SMR HDD. FIG. 2 presents a diagram illustrating an exemplary architecture of a storage system, according to one embodiment. Storage system 200 can include an SMR-based main storage cluster 202, which comprises a plurality of server nodes (e.g., server nodes 204 and 206) . More specifically, SMR-based main storage cluster 202 can be a distributed server cluster, which stores the complete copies of all archival data (also referred to as data objects) . Data objects can be distributed into hundreds or thousands of servers using hashing or indexing methods. To increase the storage density and reduce cost, each server node includes an HM SMR HDD for storage purposes. For example, server node 204 includes an HM SMR HDD 208.
Storage system 200 can also include an external staging storage cluster 210 coupled to SMR-based main storage cluster 202 via a network 212. External staging storage cluster 210 can include one or more staging nodes (e.g., staging nodes 214 and 216) . Each staging node can use SSDs or CMR HDDs for storage purposes. For example, staging node 214 can include an SSD 218 and staging node 216 can include a CMR HDD 220. The total storage capacity of external staging storage cluster 210 can be much smaller than that of main storage  cluster 202. In some embodiments, the storage capacity of external staging storage cluster 210 can be between 2%and 5%of that of main storage cluster 202.
During operation, incoming data objects (e.g., to be written user data files) will be staged or accumulated in external staging storage cluster 210 and stored into a new zone file (e.g., zone file 220 or 222) , which can grow in size as more data objects come in. Once this new zone file is full (e.g., its size reaches the size of the SMR zone) , it can be sent to main storage cluster 202 for permanent storage. In FIG. 2, zone file 220 is full and will migrate to main storage cluster 202, whereas zone file 222 is not yet full and will remain in external staging storage cluster 210.
In some embodiments, external staging storage cluster 220 can maintain an object-mapping table internally. More specifically, the object-mapping table maps each data object to one or more zone files. Depending on the size of the data object, a single data object can reside in one zone file or multiple zone files. The object-mapping table facilitates the retrieval of the data objects during read operations.
FIG. 3 presents a flowchart illustrating exemplary read and write processes, according to one embodiment. During operation, the system receives a user object IO request (operation 302) and determines the IO type of the request (operation 304) . More specifically, the external staging storage cluster can act as the interface for the user and receive the user IO request. If the IO request is an object-read request, the system performs a lookup in the object-mapping table (operation 306) and reads the object out (operation 308) . Note that the object-mapping table maps user data objects to zone files and is also maintained by the external staging storage cluster. On the other hand, the main storage cluster maintains the mapping between the zone files and their physical locations (e.g., SMR zones) on the SMR HDDs. Retrieval of the zone files stored in the SMR HDDs can be handled by standard storage management modules residing on the  main storage cluster. The system then returns the object to the user (operation 310) and receives a new IO request (operation 302) .
On the other hand, if the IO request is an object-write request, the system writes the object into a zone file stored in the external staging storage cluster (operation 312) . The system then determines whether the size of the zone file reaches a predetermined threshold (e.g., the size of the zone file is substantially equal to the size of an SMR zone) (operation 314) . If so, the system locates an empty zone in a selected SMR HDD in the main storage cluster and writes the zone file into the empty zone (operation 316) . If not, the system receives a new IO request (operation 304) . More specifically, the zone file can be sent by the external staging storage cluster to the main storage cluster, which will then write the zone-size-aligned zone file into the SMR HDDs according to standard SMR protocols.
Garbage Collection
Garbage collection (GC) is a form of automatic storage management. The garbage collector (or collector) attempts to reclaim garbage, referred to as storage space occupied by objects that are no longer in use. Unlike traditional HM SMR HDDs where garbage collection is controlled by the SMR host server, in some embodiments of the present invention, the external staging storage cluster controls the garbage-collection process. More specifically, the external staging storage cluster can run one or more GC tasks as background processes. GC is needed when the free storage space in the main storage cluster is less than a predetermined threshold. However, GC is not automatically triggered when there is not enough storage space. On the other hand, to reduce the impact of GC tasks on the data read activities, the system may only start GC when the read activities are light (e.g., the number of read requests are below a threshold) . Note that, the system does not need to consider write requests when determining  whether to start GC, because the object write-requests are handled by the external staging storage cluster.
FIG. 4 presents a flowchart illustrating an exemplary garbage-collection process, according to one embodiment. During operation, the system determines whether the free space in the main storage is below a predetermined threshold (operation 402) and determines whether the number of received read requests is below a predetermined threshold (operation 404) . If both the above requirements are met, the system can select one or more SMR zones that have the highest ratio of invalid data from all SMR HDDs in the main storage cluster for garbage collection (operation 406) . For example, the system may select the top one or few SMR zones for garbage collection. Otherwise, the system waits. The system can then perform the standard GC operations on those selected zones (operation 408) . The GC operations can include reading out valid data from these zones and writing the valid data into a new zone file stored in the external staging storage cluster. Note that after the valid data is read out from the selected zones, these SMR zones can be freed and ready to receive new data. The system can then determine whether the new zone file is full (operation 410) . If not, the system may select more SMR zones for GC (operation 406) . If so, the system can locate an empty zone on an SMR HDD belonging to the main storage cluster and write the full zone file into the selected zone (operation 412) . The system can then discard the previous GC zone files and update the object-mapping table (operation 414) . Because the size of the zone files is the same as the size of the SMR zones (e.g., 256 MB) , there is no need for the main storage cluster to manage the GC activities.
The GC activities mostly occur on the main storage cluster and do not interfere with the object-write operations, which happen at the external staging storage cluster. This means that GC and object-writes can happen concurrently. More particularly, the incoming data objects are written into the external staging storage cluster and do not compete for the throughput of the main  storage cluster. Moreover, in embodiments of the present invention, the GC tasks are run as background processes and are performed only in times of light read requests. Consequently, the GC activities do not cause degradation to the storage system performance.
To further reduce the impact of GC, in some embodiments, the system groups incoming data objects based on their life expectancy. Objects with similar life expectancies can be grouped together (e.g., saved to the same zone file) such that they can be stored in the same SMR zone. As a result, objects within a single zone may expire at roughly the same time, resulting in the zone file being picked up by GC. During GC, such a zone file may have most of its stored objects invalidated and deleted, thus increasing the efficiency of GC. This can reduce the overall need for GC and lower the write amplification.
FIG. 5 illustrates an exemplary scenario for grouping data objects based on life expectancy, according to one embodiment. During operation, external staging storage cluster 502 can receive a large number of user data objects of different sizes and life expectancies, e.g., data objects 504, 506, and 508. Note that the life expectancy of a user data object is often determined by the user policy, which determines how long the data objects can be stored in the main storage cluster. Such information can be sent to the staging storage cluster along with the data objects. In some embodiments, the life expectancy of a data object can be included in a label or tag attached to the data object.
In FIG. 5, the data objects are represented using circles, the size of a circle indicating the size of the data object and the fill pattern of a circle indicating its life expectancy. In the example shown in FIG. 5, data objects having a hatched fill pattern (e.g., data object 504) can have the shortest life expectancy (e.g., less than three months) , data objects having a dotted fill pattern (e.g., data object 506) can have a medium life expectancy (e.g., between three and twelve months) , and blank data objects (e.g., data object 508) can have the longest life expectancy (e.g., greater than 12 months) .
FIG. 5 shows that the user objects received by staging storage can be stored in different zone files based on their life expectancies. Instead of storing the incoming data objects indiscriminately into a currently open zone file in the order that the data objects are received, in some embodiments, external staging storage cluster 502 can simultaneously maintain multiple open zone files (e.g., zone files 510, 512, and 514) . When a data object is received, the system can determine its life expectancy (e.g., by checking a tag attached to the data object) and then store the data object into one of the open zone files based on the life expectancy of the data object. For example, data objects having the shortest life expectancy can be stored in zone file 510, data objects having a medium life expectancy can be stored in zone file 512, and data objects having the longest life expectancy can be stored in zone file 514. Some incoming data objects may have unknown life expectancies (e.g., the user policy does not set a time limit for storing those data objects) . Those data objects may be stored in different zone files. External staging storage cluster 502 also maintains an object-mapping table 516, which records the mapping between the user data objects and the zone files.
When a zone file is full, external staging storage cluster 502 can write the zone file to main storage cluster 520, which can include a number of SMR storage servers. More specifically, the full zone file can be distributed among and stored into these SMR storage servers using standard replication and erasure coding approaches. Each SMR storage server can manage an SMR HDD, which can include a number of SMR zones. Each SMR zone can accommodate a full zone file, which has the same size as the SMR zone. Because the zone files were created by grouping data objects of similar life expectancies together, each SMR zone stores data objects having similar life expectancies. For example, data objects in SMR zone 522 can all have a short life expectancy, whereas data objects in SMR zone 524 can all have a much longer life expectancy. In other words, data objects stored in an SMR zone can expire at roughly the same time, resulting in the SMR zone having a high ratio of invalid data and being ready for  GC. For example, all data objects stored in SMR zone 522 have a life expectancy of less than three months. Hence, after three months, all data objects stored in SMR zone 522 expire and are ready to be deleted. During GC, the garbage collector can erase all data objects from SMR zone 522, resulting in 100%GC efficiency. Because there is no valid data in SMR zone 522, there is no need to read out and write the valid data and, thus, no occurrence of write amplification for this example. This is an extreme example; in practice, not all SMR zones can have data objects expiring at the same time, meaning that the GC efficiency cannot always reach 100%. However, by grouping data objects based on life expectancies, the system can significantly increase GC efficiency and reduce write amplification.
In some embodiments, when writing the full zone files to the empty zones in main storage cluster 520, the system may also be configured to write zone files having similar life expectancies into adjacent empty zones, thus further enhancing the GC efficiency. This way, GC can be performed on multiple adjacent SMR zones simultaneously.
FIG. 6 illustrates an exemplary external-staging module, according to one embodiment. External-staging module 600 can include an IO-request-receiving module 602, a write-process-management module 604, a read-process-management module 606, a GC-process-management module 608, and an object-mapping-table-management module 610. In some embodiments, external-staging module 600 can be implemented as a server cluster, which can include multiple servers that collectively act as a single storage system.
IO-request-receiving module 602 receives from a user process object-read and object-write requests. Depending on the request type (e.g., read or write) , IO-request-receiving module 602 can forward an IO request to write-process-management module 604 or read-process-manage module 606.
Write-process-management module 604 manages the object-write processes and can include a number of sub-modules, such as an object-life- expectancy-determination module 612, a zone-file-management module 614, and a write module 616. Object-life-expectancy-determination module 612 can be responsible for determining the life expectancy of an incoming data object (e.g., based on a tag or label attached to the data object) . Zone-file-management module 614 can be responsible for managing a plurality of zone files that can separately store the received data objects based on their life expectancies. More specifically, zone-file-management module 614 can place data objects having similar life expectancies into a same zone file. Write module 616 can be responsible for writing a full zone file into the SMR-based main storage cluster.
Read-process-management module 606 manages the object-read processes and can include a number of sub-modules, such as a table-lookup module 618 and a read module 620. Table-lookup module 618 can be responsible for looking up the object-mapping table maintained by object-mapping-table-management module 610 to identify the zone file (s) that include the requested data objects. Read module 620 can be responsible for reading the data objects from the SMR-based main storage cluster.
GC-process-management module 608 can be responsible for managing GC tasks for the SMR-based main storage cluster. More specifically, GC-process-management module 608 can perform a number of GC-related operations, such as identifying SMR zones in the main storage cluster that have the top ratios of invalid data for GC, reading out valid data objects from those identified zones, and writing such data objects into corresponding zone files. Note that, when writing the valid data objects into the zone files, the system can consider the remaining life expectancies of the data objects in order to write them into zone files corresponding to their remaining life expectancies. For example, if the remaining life of a valid data object obtained during GC is about three months, this data object will be grouped together with incoming to-be-written data objects having a life-expectancy of about three months. In other words, valid data objects recovered from GC can be treated like other fresh to-be-written data objects. GC- process-management module 608 can also be responsible for determining whether the GC-triggering condition is met. The GC-triggering condition can include the free space of the main storage cluster being below a threshold value and the current pending read requests being less than a threshold value. This can prevent interference of the GC activities to the object-reading processes.
FIG. 7 illustrates an exemplary main storage module, according to one embodiment. In some embodiments, main storage module 700 can be implemented as a cluster of SMR-based storage servers. Main storage module 700 can include a read module 702 and a write module 704. Read module 702 receives a read request from the external staging storage cluster, which specifies one or more zone files. Read module 702 can locate the SMR zones storing these zone files and read data objects included in the zone files. Write module 704 receives full zone files from the external-staging module, selects empty SMR zones in the SMR HDDs, and writes the zone files into the empty SMR zones.
In general, embodiments of the present invention provide a solution for enhancing the performance of an SMR-based storage system. To utilize the full throughput of the SMR HDD, instead of writing data objects directly to the main storage cluster, the system accumulates data in a external staging storage cluster, which can be based on SSD and CMR and allow for concurrent writing. Moreover, the accumulated data objects are placed in zone files that are size-aligned to the SMR zones, making it possible for the main storage cluster to delegate the management of its GC activities to the external staging storage cluster. Because data-write processes occur at the external staging storage cluster and the GC processes occur at the main storage cluster, these two types of process can happen simultaneously without competing for the main storage cluster throughput. To further reduce the impact of the GC activities, the external staging storage cluster, which acts as an interface to users and receives the IO requests, can be configured to perform GC only at times of light reading  activities. This can improve the main storage cluster's performance, latency and quality of service (QoS) . Finally, the external staging storage cluster can also group the received data objects based on their life expectancies, thus significantly reducing the amount of GC activities and write amplification on the main storage cluster.
In the aforementioned examples, the main storage cluster is based on SMR SSDs. In practice, the solutions provided by embodiments of the present invention can also be used in other types of storage system that do not allow in-place-updates. For example, certain distributed storage system that uses SSDs with very limited erase cycles (e.g., quad-level cell (QLC) -or low-cost NAND-based SSDs) can also benefit from the various approaches described herein. Note that SSDs use super blocks to manage NAND chips and can exhibit the same GC requirements and interferences as SMR HDD. Therefore, the external staging cluster can use files that are size-aligned to the super blocks to stage received data objects. Such an external staging cluster can be applied in these QLC-or low-cost SSD-based storage systems to perform various functions, including write staging, read redirection, and garbage collection.
FIG. 8 conceptually illustrates an electronic system with which some embodiments of the subject technology are implemented. Electronic system 800 can be a client, a server, a computer, a smartphone, a PDA, a laptop, or a tablet computer with one or more processors embedded therein or coupled thereto, or any other sort of electronic device. Such an electronic system includes various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 800 includes a bus 808, processing unit (s) 812, a system memory 804, a read-only memory (ROM) 810, a permanent storage device 802, an input device interface 814, an output device interface 806, and a network interface 816.
Bus 808 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of electronic  system 800. For instance, bus 808 communicatively connects processing unit (s) 812 with ROM 810, system memory 804, and permanent storage device 802.
From these various memory units, processing unit (s) 812 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure. The processing unit (s) can be a single processor or a multi-core processor in different implementations.
ROM 810 stores static data and instructions that are needed by processing unit (s) 812 and other modules of the electronic system. Permanent storage device 802, on the other hand, is a read-and-write memory device. This device is a non-volatile memory unit that stores instructions and data even when electronic system 800 is off. Some implementations of the subject disclosure use a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) as permanent storage device 802.
Other implementations use a removable storage device (such as a floppy disk, flash drive, and disk drive) as permanent storage device 802. Like permanent storage device 802, system memory 804 is a read-and-write memory device. However, unlike storage device 802, system memory 804 is a volatile read-and-write memory, such as a random-access memory. System memory 804 stores some of the instructions and data that the processor needs at runtime. In some implementations, the processes of the subject disclosure are stored in system memory 804, permanent storage device 802, and/or ROM 810. From these various memory units, processing unit (s) 812 retrieves instructions to execute and data to process in order to execute the processes of some implementations.
Bus 808 also connects to input and output device interfaces 814 and 806. Input device interface 814 enables the user to communicate information and send commands to the electronic system. Input devices used with input device interface 814 include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices” ) . Output device interface 806 enables, for example, the display of images generated by the electronic system  800. Output devices used with output device interface 806 include, for example, printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD) . Some implementations include devices such as a touchscreen that function as both input and output devices.
Finally, as shown in FIG. 8, bus 808 also couples electronic system 800 to a network (not shown) through a network interface 816. In this manner, the computer can be a part of a network of computers (such as a local area network ( “LAN” ) , a wide area network ( “WAN” ) , or an intranet, or a network of networks, such as the Internet. Any or all components of electronic system 800 can be used in conjunction with the subject disclosure.
These functions described above can be implemented in digital electronic circuitry, or in computer software, firmware or hardware. The techniques can be implemented using one or more computer program products. Programmable processors and computers can be included in or packaged as mobile devices. The processes and logic flows can be performed by one or more programmable processors and by one or more programmable logic circuitry. General and special purpose computing devices and storage devices can be interconnected through communication networks.
The foregoing descriptions of various embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.

Claims (20)

  1. A computer-implemented method for managing a shingled magnetic recording (SMR) -based storage system, the method comprising:
    receiving a data object to be stored in the SMR-based storage system, wherein the SMR-based storage system comprises a staging storage subsystem and a main storage subsystem, and wherein the main storage subsystem comprises a plurality of SMR hard disk drives (HDDs) ;
    storing the received data object in one or more zone files residing on the staging storage subsystem; and
    in response to determining that a size of a respective zone file reaches a predetermined threshold, writing the zone file to an SMR zone located in an SMR HDD within the main storage subsystem.
  2. The computer-implemented method of claim 1, wherein the staging storage subsystem comprises a first cluster of storage servers, and wherein the main storage subsystem comprises a second cluster of storage servers.
  3. The computer-implemented method of claim 2, wherein the first cluster of storage servers comprises one or more of:
    a solid-state drive (SSD) -based storage server; and
    a conventional magnetic recording (CMR) hard disk drive (HDD) -based storage server.
  4. The computer-implemented method of claim 1, wherein the size of the zone file substantially equals a size of an SMR zone.
  5. The computer-implemented method of claim 1, further comprising:
    in response to determining that a garbage-collection condition is  met, performing, by the staging storage subsystem, garbage-collection operations for the main storage subsystem.
  6. The computer-implemented method of claim 5, wherein determining that the garbage-collection condition is met comprises:
    determining that a ratio of empty SMR zones in the main storage subsystem is less than a predetermined threshold; and
    determining that a number of received read requests to the SMR-based storage system is less than a predetermined threshold.
  7. The computer-implemented method of claim 5, wherein performing the garbage-collection operations comprises:
    selecting, from the main storage subsystem, an SMR zone having a highest ratio of invalid data;
    reading out, from the selected SMR zone, valid data;
    storing the valid data in a zone file residing on the staging storage subsystem; and
    erasing all data in the selected SMR zone to free the selected SMR zone.
  8. The computer-implemented method of claim 1, further comprising:
    updating an object-mapping table subsequent to storing the received data object in the one or more zone files.
  9. The computer-implemented method of claim 8, further comprising:
    receiving an object-read request;
    looking up the object-mapping table to identify one or more zone files corresponding to the object-read request;
    identifying one or more SMR zones within the main storage  subsystem corresponding to the one or more identified zone files; and
    retrieving a requested data object from the identified SMR zones.
  10. The computer-implemented method of claim 1, wherein storing the received data object further comprises:
    grouping a plurality of received data objects based on their respective life expectancies; and
    storing a group of data objects having similar life expectancies in one or more zone files corresponding to the life expectancy.
  11. A shingled magnetic recording (SMR) -based data storage system, comprising:
    a main storage subsystem, which comprises a plurality of SMR hard disk drives (HDDs) ; and
    a staging storage subsystem, wherein the staging storage subsystem is configured to:
    receive a to-be-stored data object;
    store the received data object in one or more zone files residing on the staging storage subsystem; and
    in response to determining that a size of a respective zone file reaches a predetermined threshold, write the zone file to an SMR zone located in an SMR HDD within the main storage subsystem.
  12. The data storage system of claim 11, wherein the staging storage subsystem comprises a first cluster of storage servers, and wherein the main storage subsystem comprises a second cluster of storage servers.
  13. The data storage system of claim 12, wherein the first cluster of storage servers comprises one or more of:
    a solid-state drive (SSD) -based storage server; and
    a conventional magnetic recording (CMR) hard disk drive (HDD) -based storage server.
  14. The data storage system of claim 11, wherein the size of the zone file substantially equals a size of an SMR zone.
  15. The data storage system of claim 11, wherein the staging storage subsystem is further configured to perform garbage-collection operations for the main storage subsystem in response to determining that a garbage-collection condition is met.
  16. The data storage system of claim 15, wherein while determining that the garbage-collection condition is met, the staging storage subsystem is configured to:
    determine that a ratio of empty SMR zones in the main storage subsystem is less than a predetermined threshold; and
    determine that a number of received read requests to the SMR-based storage system is less than a predetermined threshold.
  17. The data storage system of claim 15, wherein while performing the garbage-collection operations, the staging storage subsystem is configured to:
    select, from the main storage subsystem, an SMR zone having a highest ratio of invalid data;
    read out, from the selected SMR zone, valid data;
    store the valid data in a zone file residing on the staging storage subsystem; and
    erase all data in the selected SMR zone to free the selected SMR zone.
  18. The data storage system of claim 11, wherein the staging storage subsystem is further configured to update an object-mapping table subsequent to storing the received data object in the one or more zone files.
  19. The data storage system of claim 18, wherein the staging storage system is further configured to:
    receive an object-read request;
    look up the object-mapping table to identify one or more zone files corresponding to the object-read request;
    identify one or more SMR zones within the main storage subsystem corresponding to the one or more identified zone files; and
    retrieve a requested data object from the identified SMR zones.
  20. The data storage system of claim 11, wherein, while storing the received data object, the staging storage subsystem is further configured to:
    group a plurality of received data objects based on their respective life expectancies; and
    store a group of data objects having similar life expectancies in one or more zone files corresponding to the life expectancy.
PCT/CN2018/119731 2018-12-07 2018-12-07 External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives WO2020113549A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/119731 WO2020113549A1 (en) 2018-12-07 2018-12-07 External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/119731 WO2020113549A1 (en) 2018-12-07 2018-12-07 External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives

Publications (1)

Publication Number Publication Date
WO2020113549A1 true WO2020113549A1 (en) 2020-06-11

Family

ID=70974465

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/119731 WO2020113549A1 (en) 2018-12-07 2018-12-07 External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives

Country Status (1)

Country Link
WO (1) WO2020113549A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113467723A (en) * 2021-07-26 2021-10-01 浙江大华技术股份有限公司 Data storage method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050975A (en) * 2013-03-15 2014-09-17 希捷科技有限公司 Sectional reading-correcting-writing operation
CN104603740A (en) * 2012-08-08 2015-05-06 亚马逊技术股份有限公司 Archival data identification
WO2016044112A2 (en) * 2014-09-15 2016-03-24 Microsoft Technology Licensing, Llc Efficient data movement within file system volumes
CN106558320A (en) * 2015-09-18 2017-04-05 希捷科技有限责任公司 Maximize SMR drive capacity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104603740A (en) * 2012-08-08 2015-05-06 亚马逊技术股份有限公司 Archival data identification
CN104050975A (en) * 2013-03-15 2014-09-17 希捷科技有限公司 Sectional reading-correcting-writing operation
WO2016044112A2 (en) * 2014-09-15 2016-03-24 Microsoft Technology Licensing, Llc Efficient data movement within file system volumes
CN106716334A (en) * 2014-09-15 2017-05-24 微软技术许可有限责任公司 Efficient data movement within file system volumes
CN106558320A (en) * 2015-09-18 2017-04-05 希捷科技有限责任公司 Maximize SMR drive capacity

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113467723A (en) * 2021-07-26 2021-10-01 浙江大华技术股份有限公司 Data storage method, device, equipment and medium
CN113467723B (en) * 2021-07-26 2024-06-07 浙江大华技术股份有限公司 Data storage method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US10474397B2 (en) Unified indirection in a multi-device hybrid storage unit
US8838903B2 (en) Priority ordered multi-medium solid-state storage system and methods for use
US8595451B2 (en) Managing a storage cache utilizing externally assigned cache priority tags
US10795586B2 (en) System and method for optimization of global data placement to mitigate wear-out of write cache and NAND flash
CN102576293B (en) Data management in solid storage device and Bedding storage system
US8909887B1 (en) Selective defragmentation based on IO hot spots
CN103999058B (en) Tape drive system server
US10282286B2 (en) Address mapping using a data unit type that is variable
US9256542B1 (en) Adaptive intelligent storage controller and associated methods
US8966155B1 (en) System and method for implementing a high performance data storage system
US11126561B2 (en) Method and system for organizing NAND blocks and placing data to facilitate high-throughput for random writes in a solid state drive
US7987328B2 (en) Data archive system
WO2013097621A1 (en) Data-tiered storage processing method and device and storage device
US10552045B2 (en) Storage operation queue
US9183127B2 (en) Sequential block allocation in a memory
US10635346B2 (en) Self-trimming of data stored in non-volatile memory using data storage controller
US10298649B2 (en) Guaranteeing stream exclusivity in a multi-tenant environment
KR20180086120A (en) Tail latency aware foreground garbage collection algorithm
US10698815B2 (en) Non-blocking caching for data storage drives
JP2019028954A (en) Storage control apparatus, program, and deduplication method
US10649909B2 (en) Logical block addressing range collision crawler
US10169160B2 (en) Database batch update method, data redo/undo log producing method and memory storage apparatus
WO2020113549A1 (en) External staging storage cluster mechanism to optimize archival data storage system on shingled magnetic recording hard disk drives
US9235352B2 (en) Datastore for non-overwriting storage devices
CN104598166B (en) Method for managing system and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18942119

Country of ref document: EP

Kind code of ref document: A1