US20220129420A1 - Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device - Google Patents

Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device Download PDF

Info

Publication number
US20220129420A1
US20220129420A1 US17/572,077 US202217572077A US2022129420A1 US 20220129420 A1 US20220129420 A1 US 20220129420A1 US 202217572077 A US202217572077 A US 202217572077A US 2022129420 A1 US2022129420 A1 US 2022129420A1
Authority
US
United States
Prior art keywords
ssd
write request
request
write
ssd controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/572,077
Inventor
Yun-Sheng CHANG
Ren-Shuo Liu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Tsing Hua University NTHU
Original Assignee
National Tsing Hua University NTHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Tsing Hua University NTHU filed Critical National Tsing Hua University NTHU
Priority to US17/572,077 priority Critical patent/US20220129420A1/en
Publication of US20220129420A1 publication Critical patent/US20220129420A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0804Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with main memory updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0875Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1032Reliability improvement, data loss prevention, degraded operation etc
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/20Employing a main memory using a specific memory technology
    • G06F2212/202Non-volatile memory
    • G06F2212/2022Flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/452Instruction code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/72Details relating to flash memory management
    • G06F2212/7201Logical to physical mapping or translation of blocks or pages

Definitions

  • the disclosure relates to a computer system, a method for facilitating recovery from a crash of a solid-state storage device (SSD), a method of data synchronization, and a SSD.
  • SSD solid-state storage device
  • a conventional solid-state storage device (SSD), especially a consumer-grade SSD with a standard block device interface, is often short of sufficient mechanisms for crash recovery at disk level. Therefore, developers of filesystems and/or application software have to resort to additional measures to ensure stronger guarantee of data integrity (and sometimes data security) upon a crash, such as to modify the filesystems and/or application software to issue redundant write requests or flush requests. However, most of these measures may be unfavorable when considering overhead, and may be adverse to overall system performance.
  • an object of the disclosure is to provide a computer system, a method for facilitating recovery from a crash of a solid-state storage device (SSD), a method of data synchronization, and a SSD that can alleviate at least one of the drawbacks of the prior art.
  • SSD solid-state storage device
  • the SSD includes a nonvolatile memory and an SSD controller.
  • the nonvolatile memory includes a plurality of pages each of which has a spare area.
  • the SSD controller receives from a host a write request to write data in at least one of the pages.
  • the method for facilitating recovery from a crash of an SSD includes steps of:
  • the method of data synchronization is to be implemented by a computer system that includes a computing apparatus and the SSD.
  • the computing apparatus executes application software and data management software, and includes a main memory.
  • the SSD is communicable with the computing apparatus. The method includes steps of:
  • the computer system includes a solid-state storage device (SSD) that includes a nonvolatile memory, a main memory that is configured to store software, and a processor that is electrically connected to the SSD and the main memory, and that is configured to execute the software stored in the main memory.
  • the SSD is configured to receive a plurality of write requests in order. Each of the write requests contains a specified address range and data to be written in the SSD.
  • the SSD is operable in an order-preserving mode where the SSD persists, in the nonvolatile memory, the data contained in the write requests according to an order in which the write requests are received.
  • the SSD includes a nonvolatile memory and receives a plurality of write requests in order.
  • Each of the write requests contains data to be written in the SSD.
  • the SSD is operable in an order-preserving mode where the SSD persists, in the nonvolatile memory, the data contained in the write requests according to an order in which the write requests are received.
  • FIG. 1 is a block diagram illustrating an embodiment of a computer system according to the disclosure
  • FIG. 2 is a schematic diagram illustrating embodiments of a method of data synchronization according to the disclosure
  • FIG. 3 is a flow chart illustrating one of embodiments of the method of data synchronization according to the disclosure
  • FIG. 4 is a flow chart illustrating another one of embodiments of the method of data synchronization according to the disclosure.
  • FIG. 5 is a flow chart illustrating still another one of embodiments of the method of data synchronization according to the disclosure.
  • FIG. 6 is a schematic diagram illustrating performance of the method of data synchronization according to the disclosure.
  • FIGS. 7, 8, 10, 12 and 13 are flow charts illustrating an embodiment of a method for facilitating recovery from a crash of a solid-state storage device (SSD) according to the disclosure;
  • FIG. 9 is a schematic diagram illustrating an example of write coalescing tracking of the method for facilitating recovery from a crash of an SSD according to the disclosure.
  • FIG. 11 is a schematic diagram illustrating an example of mapping table checkpointing of the method for facilitating recovery from a crash of an SSD according to the disclosure
  • FIG. 14 is a schematic diagram illustrating an example of a recovery determination procedure of the method for facilitating recovery from a crash of an SSD according to the disclosure
  • FIG. 15 is a schematic diagram illustrating an example of comparison of valid post-crash results between a conventional SSD and an SSD according to the disclosure.
  • SSDs solid-state storage devices
  • HDD hard disk drive
  • a transactional SSD To enhance data integrity against crash events, one type of SSD (hereinafter referred to as a transactional SSD) is endowed with a set of properties related to database transactions which include atomicity, consistency, isolation and durability (ACID). Transactions are often composed of multiple statements, and atomicity guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely. Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash, and a crash is given as an example hereinafter to represent the system failure). Using a revolutionized interface, the transactional SSD offers stronger crash guarantee than an SSD with a conventional interface (hereinafter referred to as a baseline SSD).
  • a baseline SSD hereinafter referred to as a baseline SSD
  • the computer system includes a computing apparatus 2 and an SSD 1 .
  • the computing apparatus 2 may be implemented by a personal computer (PC), a database server, a cloud server, a laptop computer, a tablet computer, a mobile phone, a wearable computer, a smartwatch, a television, a datacenter cluster, a network attached storage or the like.
  • PC personal computer
  • the computing apparatus 2 includes a main memory 22 that is configured to store software, and a processor 21 that is electrically connected to the SSD 1 and the main memory 22 , and that is configured to execute the software stored in the main memory 22 .
  • the SSD 1 and the computing apparatus 2 are communicable with each other via a disk interface, especially, the widely-used standard block device interface, such as SATA.
  • the SSD 1 includes a write cache 13 , a nonvolatile memory 12 and an SSD controller 11 .
  • the write cache 13 is implemented by a volatile memory such as a static random access memory (SRAM), a synchronous dynamic random access memory (SDRAM) or a dynamic random access memory (DRAM), but is not limited thereto.
  • SRAM static random access memory
  • SDRAM synchronous dynamic random access memory
  • DRAM dynamic random access memory
  • the nonvolatile memory 12 is exemplified by flash memory, such as a flash chip, but is not limited thereto and may vary in other embodiments.
  • the nonvolatile memory 12 may be one of a battery-powered DRAM, 3D XPoint memory, phase-change memory (PCM), spin-transfer torque magnetic RAM (STT-MRAM), resistive RAM (ReRAM), an electrically erasable programmable read-only memory (EEPROM), and so on.
  • the flash chip includes a plurality of blocks.
  • Each of the blocks includes a plurality of flash pages, and each of the flash pages has a user area and a spare area.
  • the user area includes a plurality of sectors.
  • the SSD controller 11 executes firmware that includes a flash translation layer (FTL).
  • FTL flash translation layer
  • the FTL is adapted to translate a set of requests (e.g., a write request, a read request and a flush request) issued by a host (e.g., an operating system run by the computing apparatus 2 ) into a set of flash operations (e.g., page-program, page-read, and block-erase).
  • a host e.g., an operating system run by the computing apparatus 2
  • flash operations e.g., page-program, page-read, and block-erase.
  • the FTL is implemented by a page-level FTL, but is not limited thereto and may be a block-level FTL or a hybrid FTL in other embodiments.
  • the SSD controller 11 executing the FLT is configured to segment the data into pages based on the specified address range, with each of the pages being indexed by a logical page number (LPN), and to write the data thus segmented in at least one of the flash pages, with each of the flash pages being indexed by a physical page number (PPN).
  • LPN logical page number
  • PPN physical page number
  • the SSD controller 11 executing the FLT is configured to translate the specified address range into the LPNs, and then to look up the PPNs corresponding to the LPNs in the L2P mapping table. In this way, the SSD controller 11 is able to return the data written in the specified address range (hereinafter also referred to as readout data) to the host.
  • the access speed of the flash chip is much slower than that of the DRAM. Therefore, to enhance efficiency of operations of the SSD, a write request received from the host is usually cached in the write cache 13 before performing the page-program.
  • the SSD controller 11 executing the FLT Upon receiving a flush request, the SSD controller 11 executing the FLT is configured to refrain from returning a successful acknowledgement to the host until all valid data or dirty data (i.e., data of incomplete write requests) in the write cache 13 has been stored in stable media such as the nonvolatile memory 12 (i.e., the flash chip in this embodiment).
  • the SSD 1 of the computer system according to the disclosure when receiving a plurality of write requests in order, wherein each of the write requests contains a specified address range and data to be written in the SSD 1 , the SSD 1 is operable in an order-preserving mode where the SSD 1 persists, in the nonvolatile memory, the data contained in the write requests according to an order in which the write requests are received. It should be noted that once data has been persisted in an SSD, the data will be reserved for recovery when a crash of the SSD occurs regardless of whether or not the data is written in a nonvolatile memory of the SSD.
  • the SSD 1 of the computer system according to the disclosure will also be referred to as an Order-Preserving Translation and Recovery (OPTR) SSD.
  • OPTR Order-Preserving Translation and Recovery
  • the SSD 1 sequentially receives a first write request that contains a first address range of the SSD 1 and first data to be written in the first address range, and a second write request that contains a second address range of the SSD 1 and second data to be written in the second address range
  • the SSD 1 is configured to, during recovery from a crash of the SSD 1 , restore the second address range to the state of not having been updated by the second write request (i.e., to restore the second address range to a state before being programmed by the second write request) when it is determined that the first address range has been partially updated by the first write request and the second address range has been fully updated by the second write request.
  • the actual order of writing data onto the nonvolatile memory 12 (i.e., the order of executing page-program) is not really preserved in the OPTR SSD. In fact, it is the order in which write requests are received and according to which data are to be persisted in the nonvolatile memory 12 that is actually preserved after a crash of the OPTR SSD.
  • the SSD 1 further includes an identifier of order-preserving guarantee which indicates that the SSD 1 is operating in the order-preserving mode.
  • the identifier may be software-accessible, and may be either editable or read-only.
  • the OPTR SSD may be configured, through editing the identifier, to switch between the order-preserving mode and a conventional mode adopted by the baseline SSD.
  • the OPTR SSD may operate only in the order-preserving mode or may automatically determine an operation mode to switch to from the order-preserving mode.
  • the identifier is a binary code.
  • the processor 21 executing the software is configured to send a query about the identifier to the SSD 1 , and to determine that the SSD 1 is operating in the order-preserving mode when it is determined based on a reply to the query that the logical value of the identifier thus queried is one.
  • the identifier may be in a predefined set (e.g., ⁇ 1 ⁇ ), or may be a predefined set of one or more text strings that indicate a list of product names of the SSDs, or a predefined set of identification values of the SSDs.
  • the computer system or the OPTR SSD is provided with a human-readable indicator indicating that the SSD 1 is operable in the order-preserving mode.
  • the human-readable indicator is a symbol, a picture, a sticker or text on the SSD 1 , or a relevant description posted on the Internet.
  • implementation of the human-readable indicator is not limited to the disclosure herein and may vary in other embodiments.
  • a barrier is a type of order-preserving approach to guaranteeing that two requests received before and after a barrier request are completed in an order the two requests are received.
  • the order of completing the two requests separated by the barrier cannot be altered, and a required partial order of transferring write requests to a disk may be assured, where the partial order means that the order of two requests respectively in a prior request group and a later request group separated by the barrier cannot be interchanged.
  • the flush request which forces data in the write cache 13 that was received prior to the flush request to be written into the nonvolatile memory 12 , is utilized as a substitute for the barrier request.
  • FIGS. 2 to 5 embodiments of a method of data synchronization according to the disclosure are illustrated.
  • the method of data synchronization is adapted to be implemented by the computer system as previously described.
  • the software executed by the computing apparatus 2 is application software and a data management software, and the data management software can be a filesystem or an equivalent.
  • the filesystem is exemplified by the ext4 for Linux operating system
  • the application software is exemplified by the SQLite database management system.
  • the application software issues an instruction of synchronization (i.e., “fdatasync( )” shown in FIG. 2 ) to the filesystem.
  • the filesystem issues a command to transfer a journal (i.e., “Journal” shown in FIG. 2 , and the journal can be, e.g., a redo log or an undo log of the filesystem) to the SSD, issues a flush request (i.e., “Flush” shown in FIG.
  • the command to transfer a commit record serves as an action or a command to complete a transaction, i.e., to ensure that the transaction is atomic.
  • the flush request between the journal and the commit record is used to prevent the conventional baseline SSD from persisting the commit record prior to the journal because the conventional baseline SSD does not guarantee that the write requests are completed in order and atomically.
  • the flush request between the journal and the commit record is safely omissible since the OPTR SSD guarantees to complete the write requests in order and atomically.
  • the processor 21 executing the software is configured to read the identifier of the SSD 1 so as to determine whether the SSD 1 is operating in the order-preserving mode, and to enable a no-barrier option of the software to refrain from issuing any flush request when it is determined that the SSD 1 is operating in the order-preserving mode.
  • the flush requests immediately prior to and immediately subsequent to the command to transfer a commit record are both omitted when the instruction of synchronization is executed.
  • FIG. 3 and part ( 4 ) of FIG. 2 Such embodiment is illustrated in FIG. 3 and part ( 4 ) of FIG. 2 .
  • the filesystem is ext4 with the no-barrier option (denoted by “ext4-o nobarrier” in FIG. 2 and “ext4-nobarrier” in FIG. 6 ), and the application software is the SQLite. It should be noted that implementations of the filesystem and the application software are not limited to the disclosure herein and may vary in other embodiments.
  • the method of data synchronization according to the disclosure includes steps S 11 and S 12 outlined below.
  • step S 11 the application software issues to the filesystem an instruction of synchronization for synchronizing the main memory 22 and the SSD 1 .
  • step S 12 in response to receipt of the instruction of synchronization, the filesystem issues a command to transfer a journal to the SSD 1 , and issues a command to transfer a commit record to the SSD 1 immediately subsequent to issuing the command to transfer the journal.
  • the command to transfer the journal is issued without being succeeded by a flush request.
  • This embodiment is practical and useful for applications like smartphones, consumer-grade computers, and less-critical database systems such as SQlite.
  • step S 12 ′ in this embodiment, in response to receipt of the instruction for synchronization issued in step S 11 by the application software, SQLite, the filesystem further issues a flush request immediately subsequent to issuing the command to transfer the commit record.
  • this embodiment may still guarantee the same durability as the conventional approach does in part ( 1 ) of FIG. 2 .
  • FIG. 5 and part ( 3 ) of FIG. 2 still another one of the embodiments of the method of data synchronization according to the disclosure is illustrated.
  • This embodiment is a variant of that shown in FIG. 4 and part ( 2 ) of FIG. 2 , and the method of data synchronization further includes steps S 13 and S 14 .
  • step S 13 the application software (i.e., “SQLite′” shown in FIG. 2 ) issues to the filesystem an instruction of barrier-only synchronization (i.e., “fdatafence( )” in FIG. 2 or other naming such as “fdatabarrier” in other embodiments) for synchronizing the main memory 22 and the SSD 1 .
  • barrier-only synchronization i.e., “fdatafence( )” in FIG. 2 or other naming such as “fdatabarrier” in other embodiments
  • the data management software offers two types of synchronization instructions that involve different numbers of flush commands for the application software to invoke, as shown in FIG. 2 .
  • step S 14 in response to receipt of the instruction of barrier-only synchronization, the filesystem issues a command to transfer a journal to the SSD 1 , and issues a command to transfer a commit record to the SSD 1 immediately subsequent to issuing the command to transfer the journal.
  • This embodiment additionally provides the instruction of barrier-only synchronization for applications where the barrier is required to define the required partial order of transferring write requests to the SSD, enhancing flexibility of applying the method of data synchronization according to the disclosure. Moreover, the instruction of synchronization can be used sparingly to promote performance of the overall computer system.
  • FIG. 6 performance of the method of data synchronization according to the disclosure is illustrated.
  • the vertical axis indicates speed performance
  • the horizontal axis indicates extent of system changes.
  • the method of data synchronization according to the disclosure realizes stronger data integrity by providing stronger request-level crash guarantee.
  • the method according to the disclosure includes several mechanisms, e.g., write completion tracking, write coalescing tracking, mapping table checkpointing, garbage collection and order-preserving recovery, that will be described in the following paragraphs.
  • the request-level crash guarantee provided by the OPTR SSD according to the disclosure features request atomicity, prefix semantics and flush semantics.
  • Request atomicity guarantees that each write request received by the SSD 1 is atomic regardless of the request size (i.e., the number of sectors to be written). To ensure request atomicity, the method provides different strategies to determine completion of a write request respectively for cases where no page-coalescing occurs and for cases where page-coalescing occurs.
  • the method includes steps S 211 to S 215 as shown in FIG. 7 and outlined below.
  • step S 211 the SSD controller 11 assigns, according to an order in which the write request was received, a write request identifier (WID) in the spare area of each written flash page that is written with the data (there would be at least one written flash page).
  • the WID is a unique sequence number for the write request, and is incremental with respect to the order of receiving the write request.
  • the WID is a8-byte integer.
  • step S 212 the SSD controller 11 assigns a request size in the spare area of each of the at least one written flash page.
  • the request size indicates a total number of the at least one of the flash pages in which the write request is to write the data.
  • the request size is expressed by a 4-byte integer. It should be noted that the order of executing steps S 211 and S 212 can be interchanged.
  • step S 213 the SSD controller 11 counts a number of appearances of the WID in the at least one written flash page to result in a WID count. It should be noted that step S 213 is executed after occurrence of a crash.
  • step S 214 the SSD controller 11 determines whether the WID count is equal to the request size.
  • step S 215 the SSD controller 11 determines that the write request is completed and is eligible for recovery after a crash of the SSD 1 .
  • two or more write requests may coalesce in the write cache of an SSD, and the write requests thus involved are referred to as coalesced write requests. This situation reduces a count of appearance(s) of the WID in the written flash page(s).
  • the method according to the disclosure includes step S 221 to S 226 for determining whether the prior request is incomplete, as shown in FIG. 8 and outlined below.
  • step S 221 for each of cache pages in the write cache 13 used to cache data corresponding to the prior write request, the SSD controller 11 tags the cache page with a dirty flag, a WID tag and a size tag.
  • the dirty flag indicates whether the cache page is a coalesced page which is coalesced with a cache page used to cache data corresponding to the later write request.
  • the WID tag stores a WID of the prior write request.
  • the size tag stores a request size which indicates a total number of flash pages in which the prior write request is to write the data.
  • step S 222 for each of the coalesced pages that is used to cache data corresponding to the prior write request and that is coalesced with the cache pages corresponding to the later write request, the SSD controller 11 records a page-coalescing record which contains the WID of the prior write request, the request size corresponding to the prior write request, and a WID of the later write request.
  • the page-coalescing record is initially recorded in a DRAM buffer of the SSD 1 , and will be eventually transferred to a reserved block of the flash chip when an amount of accumulation of the page-coalescing records reaches a capacity of a flash page of the flash chip.
  • step S 223 the SSD controller 11 counts a number of appearances of the WID of the prior write request in all written flash page(s) written with data of the prior write request (there would be at least one written flash page) to result in a WID count for the prior write request. It should be noted that step S 223 is executed after occurrence of a crash.
  • step S 224 the SSD controller 11 counts a number of appearances of the WID of the prior write request in the page-coalescing records for the coalesced pages to result in a page-coalescing count corresponding to the prior write request. It should be noted that the order of executing steps S 223 and S 224 can be interchanged.
  • step S 225 the SSD controller 11 determines whether a sum of the WID count for the prior write request and the page-coalescing count corresponding to the prior write request is smaller than the request size corresponding to the prior write request.
  • step S 226 the SSD controller 11 determines that the prior write request is incomplete and is ineligible for recovery after a crash of the SSD 1 .
  • the method further includes a step in which the SSD controller 11 refrains from making the later write request durable until it is determined that the prior write request is durable.
  • the SSD controller 11 in response to receipt of a query, transmits an indicator indicating that the SSD controller 11 refrains from making the later write request durable until it is determined that the prior write request is durable.
  • the method further includes a step in which, when the SSD controller 11 receives a flush request from the host after receiving the write request, the SSD controller 11 refrains from acknowledging the flush request until it is determined that the write request is completed.
  • the L2P mapping table is check pointed to the flash chip to speed up recovery from a crash.
  • the method according to the disclosure keeps a full checkpoint which snapshots the entirety of the L2P mapping table, and at least one incremental checkpoint which records only the differences in the L2P table that occur since the latest checkpoint (either the full checkpoint or an incremental checkpoint).
  • the method includes steps S 31 to S 32 outlined below.
  • step S 31 the SSD controller 11 assigns, for each written flash page that is written with the data (there would be at least one written flash page), an LPN in the spare area of the written flash page in addition to the WID and the request size assigned in the spare area.
  • step S 32 the SSD controller 11 establishes a full checkpoint through storing an entirety of the L2P mapping table in a reserved block of the blocks of the flash chip.
  • the full checkpoint contains a correspondence relationship between the LPN and the PPN for each of the at least one written flash page.
  • step S 33 the SSD controller 11 establishes an incremental checkpoint through storing a revised portion of the L2P mapping table revised after a latest checkpoint, which is one of the full checkpoint and the incremental checkpoint(s) that was established the last, was established.
  • each of the full checkpoint and the incremental checkpoint(s) contains a seal page that records (i) a WID corresponding to a latest write request at the time the corresponding one of the full checkpoint and the incremental checkpoint (s) was established, and (ii) a PPN corresponding to a next free flash page of the flash pages to serve as a page pointer at the time the corresponding one of the full checkpoint and the incremental checkpoint(s) was established.
  • the page pointer recorded in the seal page may be plural in number when multiple blocks are written at the same time.
  • the method according to the disclosure employs incremental checkpoints by default. When the space for storing incremental checkpoints is full, the method according to the disclosure creates a new full checkpoint and then clears the incremental checkpoints. Moreover, the method according to the disclosure employs a shadow for the full checkpoint to ensure integrity of mapping table checkpointing, and the WID can be used to determine the recency between the full and incremental checkpoints after a crash. When the shadow is employed, an immediately previous one of the full checkpoints is kept until written data that corresponds to a current one of the full checkpoints is ensured to be free from damage.
  • the crash recovery of the SSD 1 according to the disclosure is related to rebuilding the L2P mapping table, and the method according to the disclosure includes step S 41 to S 46 outlined below with reference to FIG. 12 .
  • step S 41 the SSD controller 11 reestablishes the entirety of the L2P mapping table by retrieving the full checkpoint stored in the reserved block.
  • step S 42 the SSD controller 11 revises the L2P mapping table thus established by incorporating the revised portion(s) of the L2P mapping table contained in the incremental checkpoint(s) into the L2P mapping table thus reestablished.
  • step S 43 the SSD controller 11 counts, for each of write requests received after establishment of the latest checkpoint, a number of appearances of a WID corresponding to the write request in subsequent flash pages written with the data of the write request based on the PPN recorded in the seal page of the latest checkpoint, so as to result in a post-crash WID count to indicate a total number of appearances of the WID in the subsequent flash pages.
  • step S 44 the SSD controller 11 determines, for each of the write requests received after establishment of the latest checkpoint, whether the write request is completed based on the post-crash WID count and the request size corresponding to the write request.
  • step S 45 the SSD controller 11 recovers a group of the write requests received after establishment of the latest checkpoint by using a recovery determination procedure which is related to completeness of the write requests.
  • step S 46 the SSD controller 11 updates the L2P mapping table thus revised by incorporating changes of correspondence relationships between the LPNs and the PPNs of written flash pages related to the group of the write requests thus recovered.
  • the recovery determination procedure used in step S 45 includes sub-steps S 451 to S 454 outlined below.
  • sub-step S 451 the SSD controller 11 arranges the write requests received after establishment of the latest checkpoint in an order the write requests were received.
  • sub-step S 452 the SSD controller 11 determines, for every consecutive two of the write requests, whether the consecutive two of the write requests are coalesced.
  • sub-step S 453 the SSD controller 11 determines at least one cut, with each cut being between the write requests of a consecutive pair, where there is no coalescing for either of the write requests in the consecutive pair, and the write requests before the cut are all completed.
  • the write requests before the at least one cut serve as the group of the write requests to be recovered.
  • sub-step S 454 the SSD controller 11 determines an optimum cut from among the at least one cut, where a number of the write requests before the optimum cut is the greatest among the at least one cut, and the write requests before the optimum cut serve as the group of the write requests to be recovered.
  • the SSD controller 11 receives a prior write request and a later write request, the SSD controller 11 refrains from keeping the later write request until it is determined that the prior write request is completed.
  • garbage collection as in-place updates are forbidden in the flash chip of SSDs, overwriting data is done by writing the updated data to a free flash page and leaving the outdated data in the original flash page, which is called invalid flash page.
  • the invalid flash page will be reclaimed by a dedicated routine, called garbage collection (GC), for further reuse.
  • GC garbage collection
  • some of the invalid flash pages reclaimed by GC may be important to crash recovery, that is, this method may leverage these invalid flash pages to recover the OPTR SSD from a crash to an order-preserved state. Therefore, two constraints are enforced on GC, and the method according to the disclosure further includes the following two steps to respectively implement the two constraints.
  • the SSD controller 11 refrains from reclaiming one of the flash pages that is written after establishment of the latest checkpoint. It should be noted that all write requests before a flush request should be durable and atomic, so this constraint prevents a violation of the flush semantics where flash pages written prior to a flush request but after the latest checkpoint are reclaimed by the GC, obstructing determination of completion of the write requests after the latest checkpoint.
  • the SSD controller 11 performs internal flush on the write cache 13 before performing garbage collection. Performing the internal flush would ensure that each of the flash pages reclaimed by the GC has a stable counterpart that can always survive after a crash. Therefore, tasks of GC can be simplified. To reduce the performance penalty of an internal flush, cost of performing the internal flush is mitigated by conducting GC on a batch of blocks ( 16 blocks in this embodiment).
  • the method for facilitating recovery from a crash of an SSD realizes some of the transactional properties (i.e., atomicity and durability) on the SSD with the standard block device interface by modifying firmware (FTL) of the SSD to result in the OPTR SSD according to the disclosure.
  • the OPTR SSD is endowed with strong request-level crash guarantees: a write request is not made durable unless all its prior write requests are durable; each write request is atomic; and all write requests prior to a flush request are guaranteed durable. Consequently, SSD performance may be maintained while achieving an equivalent effect that write requests are completed in order and atomically. As a result, the number of valid post-crash results can be effectively confined and significantly reduced, facilitating tasks of recovery from a crash by applications or filesystems.
  • a scenario is given as an example where the SSD controller 11 receives a first write request to update a first address range of the nonvolatile memory 12 of the SSD 1 by writing data of the first write request in the first address range, and a second write request to update a second address range of the nonvolatile memory 12 of the SSD 1 by writing data of the second write request in the second address range, wherein the second write request is issued by the host later than the first write request. Additionally, there is no flush request in between the first write request and the second write request, and no barrier request in between the first write request and the second write request.
  • the SSD controller 11 in response to a read request to read data in the first address range and the second address range, the SSD controller 11 returns readout data which is guaranteed to belong to one of cases No. 1, 2, 3, 6 and 9 when partial update is allowed (see sub-column A in the last column of Table 2), or belong to one of cases No. 1, 3 and 9 when partial update is not allowed (see sub-column B in the last column of Table 2).
  • An SSD has four sectors that initially store four version numbers, “0”, “0”, “0” and “0”, respectively.
  • the SSD received four write requests and a flush request before a crash occurred, i.e., “write(0,2)”, “write(1,2)”, “flush”, “write (0, 4)” and “write (2, 2)” in order, where each of the write requests is specified by a logical block address (LBA) and a size in parentheses.
  • LBA logical block address
  • the OPTR SSD according to the disclosure guarantees that the write requests are completed in order and atomically, so the number of valid post-crash results is significantly reduced to three.
  • operational efficiency of a computer system may be improved by removing unnecessary flush requests to be issued by the filesystem in response to receipt of the instruction of synchronization by the application software.

Abstract

A method for facilitating recovery from a crash of a solid-state storage device (SSD) is adapted to be implemented by an SSD controller of the SSD that receives a write request. The method includes: assigning a write request identifier (WID) and a request size in a spare area of each written page of the SSD; counting a number of appearances of the WID in all written page(s) to result in a WID count; determining whether the WID count is equal to the request size; and determining that the write request is completed and is eligible for recovery after a crash of the SSD when it is determined that the WID count is equal to the request size.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a continuation application of co-pending U.S. patent application Ser. No. 16/797,944, filed on Feb. 21, 2020, which claims priorities of U.S. Provisional Patent Application No. 62/809,580, filed on Feb. 23, 2019, and U.S. Provisional Patent Application No. 62/873,253, filed on Jul. 12, 2019.
  • FIELD
  • The disclosure relates to a computer system, a method for facilitating recovery from a crash of a solid-state storage device (SSD), a method of data synchronization, and a SSD.
  • BACKGROUND
  • A conventional solid-state storage device (SSD), especially a consumer-grade SSD with a standard block device interface, is often short of sufficient mechanisms for crash recovery at disk level. Therefore, developers of filesystems and/or application software have to resort to additional measures to ensure stronger guarantee of data integrity (and sometimes data security) upon a crash, such as to modify the filesystems and/or application software to issue redundant write requests or flush requests. However, most of these measures may be unfavorable when considering overhead, and may be adverse to overall system performance.
  • SUMMARY
  • Therefore, an object of the disclosure is to provide a computer system, a method for facilitating recovery from a crash of a solid-state storage device (SSD), a method of data synchronization, and a SSD that can alleviate at least one of the drawbacks of the prior art.
  • According to one aspect of the disclosure, the SSD includes a nonvolatile memory and an SSD controller. The nonvolatile memory includes a plurality of pages each of which has a spare area. The SSD controller receives from a host a write request to write data in at least one of the pages. The method for facilitating recovery from a crash of an SSD includes steps of:
      • assigning, by the SSD controller according to an order in which the write request was received, a write request identifier (WID) in the spare area of each of at least one written page that is written with the data, where the WID is a unique sequence number for the write request;
      • assigning, by the SSD controller, a request size in the spare area of each of the at least one written page, where the request size indicates a total number of the at least one of the pages in which the write request is to write the data;
      • counting, by the SSD controller, appearance of the WID in the at least one written page to result in a WID count;
      • determining, by the SSD controller, whether the WID count is equal to the request size; and
      • determining, by the SSD controller, that the write request is completed and is eligible for recovery after a crash of the SSD when it is determined that the WID count is equal to the request size.
  • According to another aspect of the disclosure, the method of data synchronization is to be implemented by a computer system that includes a computing apparatus and the SSD. The computing apparatus executes application software and data management software, and includes a main memory. The SSD is communicable with the computing apparatus. The method includes steps of:
      • issuing, by the application software to the data management software, an instruction of synchronization for synchronizing the main memory and the SSD; and
      • by the data management software in response to receipt of the instruction of synchronization, issuing a command to transfer a journal to the SSD, and issuing a command to transfer a commit record to the SSD immediately subsequent to issuing the command to transfer the journal.
  • According to still another aspect of the disclosure, the computer system includes a solid-state storage device (SSD) that includes a nonvolatile memory, a main memory that is configured to store software, and a processor that is electrically connected to the SSD and the main memory, and that is configured to execute the software stored in the main memory. The SSD is configured to receive a plurality of write requests in order. Each of the write requests contains a specified address range and data to be written in the SSD. The SSD is operable in an order-preserving mode where the SSD persists, in the nonvolatile memory, the data contained in the write requests according to an order in which the write requests are received.
  • According to further another aspect of the disclosure, the SSD includes a nonvolatile memory and receives a plurality of write requests in order. Each of the write requests contains data to be written in the SSD. The SSD is operable in an order-preserving mode where the SSD persists, in the nonvolatile memory, the data contained in the write requests according to an order in which the write requests are received.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Other features and advantages of the disclosure will become apparent in the following detailed description of the embodiment with reference to the accompanying drawings, of which:
  • FIG. 1 is a block diagram illustrating an embodiment of a computer system according to the disclosure;
  • FIG. 2 is a schematic diagram illustrating embodiments of a method of data synchronization according to the disclosure;
  • FIG. 3 is a flow chart illustrating one of embodiments of the method of data synchronization according to the disclosure;
  • FIG. 4 is a flow chart illustrating another one of embodiments of the method of data synchronization according to the disclosure;
  • FIG. 5 is a flow chart illustrating still another one of embodiments of the method of data synchronization according to the disclosure;
  • FIG. 6 is a schematic diagram illustrating performance of the method of data synchronization according to the disclosure;
  • FIGS. 7, 8, 10, 12 and 13 are flow charts illustrating an embodiment of a method for facilitating recovery from a crash of a solid-state storage device (SSD) according to the disclosure;
  • FIG. 9 is a schematic diagram illustrating an example of write coalescing tracking of the method for facilitating recovery from a crash of an SSD according to the disclosure;
  • FIG. 11 is a schematic diagram illustrating an example of mapping table checkpointing of the method for facilitating recovery from a crash of an SSD according to the disclosure;
  • FIG. 14 is a schematic diagram illustrating an example of a recovery determination procedure of the method for facilitating recovery from a crash of an SSD according to the disclosure; and FIG. 15 is a schematic diagram illustrating an example of comparison of valid post-crash results between a conventional SSD and an SSD according to the disclosure.
  • DETAILED DESCRIPTION
  • Before the disclosure is described in greater detail, it should be noted that where considered appropriate, reference numerals or terminal portions of reference numerals have been repeated among the figures to indicate corresponding or analogous elements, which may optionally have similar characteristics.
  • In spite of the fact that solid-state storage devices (SSDs) have been widely used for decades, design principles for optimizing performance of a hard disk drive (HDD) remain pervasive in computer systems compatible with both SSDs and HDDs, such as minimization of seek time and rotational latency by means of reordering requests based on the pickup head position of the HDD. However, SSDs may not benefit from such design principles due to differences in physical structure and operating principle between the SSD and the HDD. For example, reordering requests may complicate the search space needed by filesystems or applications to recover from a crash of an SSD.
  • To enhance data integrity against crash events, one type of SSD (hereinafter referred to as a transactional SSD) is endowed with a set of properties related to database transactions which include atomicity, consistency, isolation and durability (ACID). Transactions are often composed of multiple statements, and atomicity guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely. Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash, and a crash is given as an example hereinafter to represent the system failure). Using a revolutionized interface, the transactional SSD offers stronger crash guarantee than an SSD with a conventional interface (hereinafter referred to as a baseline SSD). Nevertheless, recognizing the dramatic system changes required to become compatible with a transactional SSD, a lot of existing filesystems (e.g., fourth extended file system, Ext4, for Linux operating system) and application software (e.g., SQLite database management system) still utilize the conventional interface (e.g., a standard block device interface like SATA, which is optimized for HDDs) in establishing communication with SSDs instead of adopting the revolutionized interface used by transactional SSDs. Therefore, backward compatibility is still demanded when optimizing data integrity of an SSD to against a crash.
  • Referring to FIG. 1, an embodiment of a computer system according to the disclosure is illustrated. The computer system includes a computing apparatus 2 and an SSD 1.
  • The computing apparatus 2 may be implemented by a personal computer (PC), a database server, a cloud server, a laptop computer, a tablet computer, a mobile phone, a wearable computer, a smartwatch, a television, a datacenter cluster, a network attached storage or the like. However, implementation of the computing apparatus 2 is not limited to the disclosure herein and may vary in other embodiments. The computing apparatus 2 includes a main memory 22 that is configured to store software, and a processor 21 that is electrically connected to the SSD 1 and the main memory 22, and that is configured to execute the software stored in the main memory 22. The SSD 1 and the computing apparatus 2 are communicable with each other via a disk interface, especially, the widely-used standard block device interface, such as SATA.
  • The SSD 1 includes a write cache 13, a nonvolatile memory 12 and an SSD controller 11.
  • In this embodiment, the write cache 13 is implemented by a volatile memory such as a static random access memory (SRAM), a synchronous dynamic random access memory (SDRAM) or a dynamic random access memory (DRAM), but is not limited thereto.
  • The nonvolatile memory 12 is exemplified by flash memory, such as a flash chip, but is not limited thereto and may vary in other embodiments. For example, the nonvolatile memory 12 may be one of a battery-powered DRAM, 3D XPoint memory, phase-change memory (PCM), spin-transfer torque magnetic RAM (STT-MRAM), resistive RAM (ReRAM), an electrically erasable programmable read-only memory (EEPROM), and so on.
  • The flash chip includes a plurality of blocks. Each of the blocks includes a plurality of flash pages, and each of the flash pages has a user area and a spare area. The user area includes a plurality of sectors.
  • The SSD controller 11 executes firmware that includes a flash translation layer (FTL). The FTL is adapted to translate a set of requests (e.g., a write request, a read request and a flush request) issued by a host (e.g., an operating system run by the computing apparatus 2) into a set of flash operations (e.g., page-program, page-read, and block-erase). In this embodiment, the FTL is implemented by a page-level FTL, but is not limited thereto and may be a block-level FTL or a hybrid FTL in other embodiments.
  • Specifically speaking, after receiving from the host a write request to write data in a specified address range of the SSD 1, the SSD controller 11 executing the FLT is configured to segment the data into pages based on the specified address range, with each of the pages being indexed by a logical page number (LPN), and to write the data thus segmented in at least one of the flash pages, with each of the flash pages being indexed by a physical page number (PPN). A correspondence relationship between the LPNs and PPNs is recorded in a logical-to-physical (L2P) mapping table. Afterward, in response to receipt of a read request from the host to retrieve the data written in the specified address range, the SSD controller 11 executing the FLT is configured to translate the specified address range into the LPNs, and then to look up the PPNs corresponding to the LPNs in the L2P mapping table. In this way, the SSD controller 11 is able to return the data written in the specified address range (hereinafter also referred to as readout data) to the host.
  • It should be noted that the access speed of the flash chip, especially for the page-program (i.e., the flash operation to write data in the flash page), is much slower than that of the DRAM. Therefore, to enhance efficiency of operations of the SSD, a write request received from the host is usually cached in the write cache 13 before performing the page-program.
  • Upon receiving a flush request, the SSD controller 11 executing the FLT is configured to refrain from returning a successful acknowledgement to the host until all valid data or dirty data (i.e., data of incomplete write requests) in the write cache 13 has been stored in stable media such as the nonvolatile memory 12 (i.e., the flash chip in this embodiment).
  • Conventionally, high performance schemes such as internal parallelism, request scheduling, and write caching are adopted to improve performance of the SSDs. However, these high performance schemes all break the order of write requests. For the SSD 1 of the computer system according to the disclosure, when receiving a plurality of write requests in order, wherein each of the write requests contains a specified address range and data to be written in the SSD 1, the SSD 1 is operable in an order-preserving mode where the SSD 1 persists, in the nonvolatile memory, the data contained in the write requests according to an order in which the write requests are received. It should be noted that once data has been persisted in an SSD, the data will be reserved for recovery when a crash of the SSD occurs regardless of whether or not the data is written in a nonvolatile memory of the SSD. Hereinafter, the SSD 1 of the computer system according to the disclosure will also be referred to as an Order-Preserving Translation and Recovery (OPTR) SSD.
  • Furthermore, in a scenario where the SSD 1 sequentially receives a first write request that contains a first address range of the SSD 1 and first data to be written in the first address range, and a second write request that contains a second address range of the SSD 1 and second data to be written in the second address range, the SSD 1 is configured to, during recovery from a crash of the SSD 1, restore the second address range to the state of not having been updated by the second write request (i.e., to restore the second address range to a state before being programmed by the second write request) when it is determined that the first address range has been partially updated by the first write request and the second address range has been fully updated by the second write request.
  • It should be noted that on account of adopting high performance schemes of the SSD such as internal parallelism, the actual order of writing data onto the nonvolatile memory 12 (i.e., the order of executing page-program) is not really preserved in the OPTR SSD. In fact, it is the order in which write requests are received and according to which data are to be persisted in the nonvolatile memory 12 that is actually preserved after a crash of the OPTR SSD.
  • To facilitate recognition of the order preserving feature of an SSD by the computer system, in one embodiment, the SSD 1 further includes an identifier of order-preserving guarantee which indicates that the SSD 1 is operating in the order-preserving mode. The identifier may be software-accessible, and may be either editable or read-only. Additionally, in a scenario where the identifier is software-accessible and editable, the OPTR SSD may be configured, through editing the identifier, to switch between the order-preserving mode and a conventional mode adopted by the baseline SSD. Moreover, in a scenario where the identifier is software-accessible and read-only, the OPTR SSD may operate only in the order-preserving mode or may automatically determine an operation mode to switch to from the order-preserving mode.
  • In this embodiment, the identifier is a binary code. The processor 21 executing the software is configured to send a query about the identifier to the SSD 1, and to determine that the SSD 1 is operating in the order-preserving mode when it is determined based on a reply to the query that the logical value of the identifier thus queried is one. However, in other embodiments, the identifier may be in a predefined set (e.g., {1}), or may be a predefined set of one or more text strings that indicate a list of product names of the SSDs, or a predefined set of identification values of the SSDs.
  • In one embodiment, the computer system or the OPTR SSD is provided with a human-readable indicator indicating that the SSD 1 is operable in the order-preserving mode. The human-readable indicator is a symbol, a picture, a sticker or text on the SSD 1, or a relevant description posted on the Internet. However, implementation of the human-readable indicator is not limited to the disclosure herein and may vary in other embodiments.
  • It is worth to note that using a barrier is a type of order-preserving approach to guaranteeing that two requests received before and after a barrier request are completed in an order the two requests are received. In other words, the order of completing the two requests separated by the barrier cannot be altered, and a required partial order of transferring write requests to a disk may be assured, where the partial order means that the order of two requests respectively in a prior request group and a later request group separated by the barrier cannot be interchanged. However, taking into account backward compatibility, most of SSDs do not support a barrier request to realize order-preserving. Therefore, the flush request, which forces data in the write cache 13 that was received prior to the flush request to be written into the nonvolatile memory 12, is utilized as a substitute for the barrier request.
  • Referring to FIGS. 2 to 5, embodiments of a method of data synchronization according to the disclosure are illustrated. The method of data synchronization is adapted to be implemented by the computer system as previously described. In these embodiments, the software executed by the computing apparatus 2 is application software and a data management software, and the data management software can be a filesystem or an equivalent.
  • Referring to part (1) of FIG. 2, the filesystem is exemplified by the ext4 for Linux operating system, and the application software is exemplified by the SQLite database management system. For a conventional baseline SSD, the application software issues an instruction of synchronization (i.e., “fdatasync( )” shown in FIG. 2) to the filesystem. In response to receipt of the instruction of synchronization, the filesystem issues a command to transfer a journal (i.e., “Journal” shown in FIG. 2, and the journal can be, e.g., a redo log or an undo log of the filesystem) to the SSD, issues a flush request (i.e., “Flush” shown in FIG. 2) immediately subsequent to issuing the command to transfer a journal, issues a command to transfer a commit record (i.e., “Commit” shown in FIG. 2) to the SSD immediately subsequent to issuing the flush request, and issues another flush request immediately subsequent to issuing the command to transfer a commit record. It is worth to note that the command to transfer a commit record serves as an action or a command to complete a transaction, i.e., to ensure that the transaction is atomic. The flush request between the journal and the commit record is used to prevent the conventional baseline SSD from persisting the commit record prior to the journal because the conventional baseline SSD does not guarantee that the write requests are completed in order and atomically. On the contrary, for the OPTR SSD of this disclosure, the flush request between the journal and the commit record is safely omissible since the OPTR SSD guarantees to complete the write requests in order and atomically.
  • In one of the embodiments of the method of data synchronization according to the disclosure, the processor 21 executing the software is configured to read the identifier of the SSD 1 so as to determine whether the SSD 1 is operating in the order-preserving mode, and to enable a no-barrier option of the software to refrain from issuing any flush request when it is determined that the SSD 1 is operating in the order-preserving mode. As a result, the flush requests immediately prior to and immediately subsequent to the command to transfer a commit record are both omitted when the instruction of synchronization is executed. Such embodiment is illustrated in FIG. 3 and part (4) of FIG. 2. In this embodiment, the filesystem is ext4 with the no-barrier option (denoted by “ext4-o nobarrier” in FIG. 2 and “ext4-nobarrier” in FIG. 6), and the application software is the SQLite. It should be noted that implementations of the filesystem and the application software are not limited to the disclosure herein and may vary in other embodiments.
  • In this embodiment, the method of data synchronization according to the disclosure includes steps S11 and S12 outlined below.
  • In step S11, the application software issues to the filesystem an instruction of synchronization for synchronizing the main memory 22 and the SSD 1.
  • In step S12, in response to receipt of the instruction of synchronization, the filesystem issues a command to transfer a journal to the SSD 1, and issues a command to transfer a commit record to the SSD 1 immediately subsequent to issuing the command to transfer the journal. In other words, the command to transfer the journal is issued without being succeeded by a flush request.
  • This embodiment is practical and useful for applications like smartphones, consumer-grade computers, and less-critical database systems such as SQlite.
  • Referring to FIG. 4 and part (2) of FIG. 2, another one of the embodiments of the method of data synchronization according to the disclosure is illustrated. As shown in step S12′, in this embodiment, in response to receipt of the instruction for synchronization issued in step S11 by the application software, SQLite, the filesystem further issues a flush request immediately subsequent to issuing the command to transfer the commit record. With minor modification on the filesystem (i.e., the filesystem denoted by “ext4′” as shown in FIG. 2), this embodiment may still guarantee the same durability as the conventional approach does in part (1) of FIG. 2.
  • Referring to FIG. 5 and part (3) of FIG. 2, still another one of the embodiments of the method of data synchronization according to the disclosure is illustrated. This embodiment is a variant of that shown in FIG. 4 and part (2) of FIG. 2, and the method of data synchronization further includes steps S13 and S14.
  • In step S13, the application software (i.e., “SQLite′” shown in FIG. 2) issues to the filesystem an instruction of barrier-only synchronization (i.e., “fdatafence( )” in FIG. 2 or other naming such as “fdatabarrier” in other embodiments) for synchronizing the main memory 22 and the SSD 1. In other words, the data management software offers two types of synchronization instructions that involve different numbers of flush commands for the application software to invoke, as shown in FIG. 2.
  • In step S14, in response to receipt of the instruction of barrier-only synchronization, the filesystem issues a command to transfer a journal to the SSD 1, and issues a command to transfer a commit record to the SSD 1 immediately subsequent to issuing the command to transfer the journal.
  • This embodiment additionally provides the instruction of barrier-only synchronization for applications where the barrier is required to define the required partial order of transferring write requests to the SSD, enhancing flexibility of applying the method of data synchronization according to the disclosure. Moreover, the instruction of synchronization can be used sparingly to promote performance of the overall computer system.
  • Referring to FIG. 6, performance of the method of data synchronization according to the disclosure is illustrated. In the plot shown in FIG. 6, the vertical axis indicates speed performance, and the horizontal axis indicates extent of system changes. Evidently, compared with the conventional approach which is implemented with the baseline SSD (indicated by circle “1”), the method of data synchronization according to the disclosure (indicated by circles “2”, “3” and “4”) achieves significantly better speed performance at the cost of minor system changes. Moreover, compared with the conventional approach which is implemented with the baseline SSD (indicated by circles “1” and “5”), by utilizing the OPTR SSD, the method of data synchronization according to the disclosure (indicated by circles “2”, “3” and “4”) realizes stronger data integrity by providing stronger request-level crash guarantee.
  • Referring to FIGS. 7 to 11, an embodiment of a method for facilitating recovery from a crash of the aforementioned SSD 1 according to the disclosure is illustrated. The method according to the disclosure includes several mechanisms, e.g., write completion tracking, write coalescing tracking, mapping table checkpointing, garbage collection and order-preserving recovery, that will be described in the following paragraphs.
  • The request-level crash guarantee provided by the OPTR SSD according to the disclosure features request atomicity, prefix semantics and flush semantics.
  • Request atomicity guarantees that each write request received by the SSD 1 is atomic regardless of the request size (i.e., the number of sectors to be written). To ensure request atomicity, the method provides different strategies to determine completion of a write request respectively for cases where no page-coalescing occurs and for cases where page-coalescing occurs.
  • Regarding the write completion tracking, for cases where no page-coalescing occurs, based on the fact that a write request which involves N pages is completed if and only if those N pages do exit in the flash chip after a crash, the method includes steps S211 to S215 as shown in FIG. 7 and outlined below.
  • In step S211, the SSD controller 11 assigns, according to an order in which the write request was received, a write request identifier (WID) in the spare area of each written flash page that is written with the data (there would be at least one written flash page). The WID is a unique sequence number for the write request, and is incremental with respect to the order of receiving the write request. In this embodiment, the WID is a8-byte integer.
  • In step S212, the SSD controller 11 assigns a request size in the spare area of each of the at least one written flash page. The request size indicates a total number of the at least one of the flash pages in which the write request is to write the data. In this embodiment, the request size is expressed by a 4-byte integer. It should be noted that the order of executing steps S211 and S212 can be interchanged.
  • In step S213, the SSD controller 11 counts a number of appearances of the WID in the at least one written flash page to result in a WID count. It should be noted that step S213 is executed after occurrence of a crash.
  • In step S214, the SSD controller 11 determines whether the WID count is equal to the request size.
  • When it is determined that the WID count is equal to the request size, in step S215, the SSD controller 11 determines that the write request is completed and is eligible for recovery after a crash of the SSD 1.
  • Regarding the write coalescing, two or more write requests may coalesce in the write cache of an SSD, and the write requests thus involved are referred to as coalesced write requests. This situation reduces a count of appearance(s) of the WID in the written flash page(s). When the SSD controller 11 receives a prior write request and a later write request both of which are to be coalesced in the write cache 13, the method according to the disclosure includes step S221 to S226 for determining whether the prior request is incomplete, as shown in FIG. 8 and outlined below.
  • In step S221, for each of cache pages in the write cache 13 used to cache data corresponding to the prior write request, the SSD controller 11 tags the cache page with a dirty flag, a WID tag and a size tag. The dirty flag indicates whether the cache page is a coalesced page which is coalesced with a cache page used to cache data corresponding to the later write request. The WID tag stores a WID of the prior write request. The size tag stores a request size which indicates a total number of flash pages in which the prior write request is to write the data.
  • In step S222, for each of the coalesced pages that is used to cache data corresponding to the prior write request and that is coalesced with the cache pages corresponding to the later write request, the SSD controller 11 records a page-coalescing record which contains the WID of the prior write request, the request size corresponding to the prior write request, and a WID of the later write request. In this embodiment, the page-coalescing record is initially recorded in a DRAM buffer of the SSD 1, and will be eventually transferred to a reserved block of the flash chip when an amount of accumulation of the page-coalescing records reaches a capacity of a flash page of the flash chip.
  • Referring to an example of coalescing records shown in FIG. 9 for explanation, “<3, 7>, 6” denotes that a prior write request whose WID is 3 coalesces with a later write request whose WID is 7, and the request size of the prior write request is 6 pages.
  • In step S223, the SSD controller 11 counts a number of appearances of the WID of the prior write request in all written flash page(s) written with data of the prior write request (there would be at least one written flash page) to result in a WID count for the prior write request. It should be noted that step S223 is executed after occurrence of a crash.
  • In step S224, the SSD controller 11 counts a number of appearances of the WID of the prior write request in the page-coalescing records for the coalesced pages to result in a page-coalescing count corresponding to the prior write request. It should be noted that the order of executing steps S223 and S224 can be interchanged.
  • In step S225, the SSD controller 11 determines whether a sum of the WID count for the prior write request and the page-coalescing count corresponding to the prior write request is smaller than the request size corresponding to the prior write request.
  • When it is determined that the sum of the WID count for the prior write request and the page-coalescing count corresponding to the prior write request is smaller than the page size requested by the prior write request, in step S226, the SSD controller 11 determines that the prior write request is incomplete and is ineligible for recovery after a crash of the SSD 1. In mathematical expression, a coalesced write request with WID=i is incomplete if Pi+Di<Sizei, where Pi represents the number of written flash pages being assigned with WID=i, Di represents the number of recorded <x, y> pairs in the page-coalescing records with x=i, and Sizei represents the request size corresponding to the coalesced write request with WID=i.
  • To satisfy prefix semantics so as to ensure that the order of write requests may be preserved, the SSD 1 does not make a write request durable unless all the write requests received previously by the SSD 1 are durable. Therefore, the method further includes a step in which the SSD controller 11 refrains from making the later write request durable until it is determined that the prior write request is durable. In one embodiment, in response to receipt of a query, the SSD controller 11 transmits an indicator indicating that the SSD controller 11 refrains from making the later write request durable until it is determined that the prior write request is durable.
  • Flush semantics guarantee durability to all write requests that are received prior to a flush request. Therefore, the method further includes a step in which, when the SSD controller 11 receives a flush request from the host after receiving the write request, the SSD controller 11 refrains from acknowledging the flush request until it is determined that the write request is completed.
  • Regarding the mapping table checkpointing, the L2P mapping table is check pointed to the flash chip to speed up recovery from a crash. The method according to the disclosure keeps a full checkpoint which snapshots the entirety of the L2P mapping table, and at least one incremental checkpoint which records only the differences in the L2P table that occur since the latest checkpoint (either the full checkpoint or an incremental checkpoint). Referring to FIGS. 10 and 11, the method includes steps S31 to S32 outlined below.
  • In step S31, the SSD controller 11 assigns, for each written flash page that is written with the data (there would be at least one written flash page), an LPN in the spare area of the written flash page in addition to the WID and the request size assigned in the spare area.
  • In step S32, the SSD controller 11 establishes a full checkpoint through storing an entirety of the L2P mapping table in a reserved block of the blocks of the flash chip. The full checkpoint contains a correspondence relationship between the LPN and the PPN for each of the at least one written flash page.
  • In step S33, the SSD controller 11 establishes an incremental checkpoint through storing a revised portion of the L2P mapping table revised after a latest checkpoint, which is one of the full checkpoint and the incremental checkpoint(s) that was established the last, was established. As shown in FIG. 11, each of the full checkpoint and the incremental checkpoint(s) contains a seal page that records (i) a WID corresponding to a latest write request at the time the corresponding one of the full checkpoint and the incremental checkpoint (s) was established, and (ii) a PPN corresponding to a next free flash page of the flash pages to serve as a page pointer at the time the corresponding one of the full checkpoint and the incremental checkpoint(s) was established. For an SSD 1 including multiple flash chips each of which may include a block being written, the page pointer recorded in the seal page may be plural in number when multiple blocks are written at the same time.
  • It is worth to note that the method according to the disclosure employs incremental checkpoints by default. When the space for storing incremental checkpoints is full, the method according to the disclosure creates a new full checkpoint and then clears the incremental checkpoints. Moreover, the method according to the disclosure employs a shadow for the full checkpoint to ensure integrity of mapping table checkpointing, and the WID can be used to determine the recency between the full and incremental checkpoints after a crash. When the shadow is employed, an immediately previous one of the full checkpoints is kept until written data that corresponds to a current one of the full checkpoints is ensured to be free from damage.
  • Regarding the order-preserving recovery, the crash recovery of the SSD 1 according to the disclosure is related to rebuilding the L2P mapping table, and the method according to the disclosure includes step S41 to S46 outlined below with reference to FIG. 12.
  • In step S41, the SSD controller 11 reestablishes the entirety of the L2P mapping table by retrieving the full checkpoint stored in the reserved block.
  • In step S42, the SSD controller 11 revises the L2P mapping table thus established by incorporating the revised portion(s) of the L2P mapping table contained in the incremental checkpoint(s) into the L2P mapping table thus reestablished.
  • In step S43, the SSD controller 11 counts, for each of write requests received after establishment of the latest checkpoint, a number of appearances of a WID corresponding to the write request in subsequent flash pages written with the data of the write request based on the PPN recorded in the seal page of the latest checkpoint, so as to result in a post-crash WID count to indicate a total number of appearances of the WID in the subsequent flash pages.
  • In step S44, the SSD controller 11 determines, for each of the write requests received after establishment of the latest checkpoint, whether the write request is completed based on the post-crash WID count and the request size corresponding to the write request.
  • In step S45, the SSD controller 11 recovers a group of the write requests received after establishment of the latest checkpoint by using a recovery determination procedure which is related to completeness of the write requests.
  • In step S46, the SSD controller 11 updates the L2P mapping table thus revised by incorporating changes of correspondence relationships between the LPNs and the PPNs of written flash pages related to the group of the write requests thus recovered.
  • Specifically speaking, referring to FIG. 13, the recovery determination procedure used in step S45 includes sub-steps S451 to S454 outlined below.
  • In sub-step S451, the SSD controller 11 arranges the write requests received after establishment of the latest checkpoint in an order the write requests were received.
  • In sub-step S452, the SSD controller 11 determines, for every consecutive two of the write requests, whether the consecutive two of the write requests are coalesced.
  • In sub-step S453, the SSD controller 11 determines at least one cut, with each cut being between the write requests of a consecutive pair, where there is no coalescing for either of the write requests in the consecutive pair, and the write requests before the cut are all completed. In one embodiment, the write requests before the at least one cut serve as the group of the write requests to be recovered.
  • In sub-step S454, the SSD controller 11 determines an optimum cut from among the at least one cut, where a number of the write requests before the optimum cut is the greatest among the at least one cut, and the write requests before the optimum cut serve as the group of the write requests to be recovered.
  • Referring to an example shown in FIG. 14 for explanation, there are six write requests after the latest checkpoint, and some of the data of the write requests has arrived in the flash chip. Potential cuts may occur at seven places “1” to “7”, but only places “1” and “3” are fit to be cuts because the two write requests between each of places “1” and “3” are non-coalesced, and because the write requests before each of places “1” and “3” are all completed. Further, cut “3” will be determined as the optimum cut because the number of the write requests before cut “3” is the greater between the cuts “1” and “3”.
  • Referring to Table 1, an example of pseudocode of the recovery determination procedure is illustrated.
  • TABLE 1
    Find optimal recovery point
    Input: widinc, C
    Output: the optimal recovery point
    1: widrec ← widinc;
    2: Sort C by x in descending order;
    3: for c ϵ C do
    4:  if c.x < widrec {circumflex over ( )} c.y > widrec then
    5:   widrec ← c.x;
    6:  end if
    7: end for
    8: return widrec;
  • It should be noted that to meet requirements of the prefix semantics during recovery from a crash, in a scenario where the SSD controller 11 receives a prior write request and a later write request, the SSD controller 11 refrains from keeping the later write request until it is determined that the prior write request is completed.
  • Regarding the garbage collection, as in-place updates are forbidden in the flash chip of SSDs, overwriting data is done by writing the updated data to a free flash page and leaving the outdated data in the original flash page, which is called invalid flash page. The invalid flash page will be reclaimed by a dedicated routine, called garbage collection (GC), for further reuse. However, some of the invalid flash pages reclaimed by GC may be important to crash recovery, that is, this method may leverage these invalid flash pages to recover the OPTR SSD from a crash to an order-preserved state. Therefore, two constraints are enforced on GC, and the method according to the disclosure further includes the following two steps to respectively implement the two constraints.
  • In one step, while performing garbage collection, the SSD controller 11 refrains from reclaiming one of the flash pages that is written after establishment of the latest checkpoint. It should be noted that all write requests before a flush request should be durable and atomic, so this constraint prevents a violation of the flush semantics where flash pages written prior to a flush request but after the latest checkpoint are reclaimed by the GC, obstructing determination of completion of the write requests after the latest checkpoint.
  • In another step, the SSD controller 11 performs internal flush on the write cache 13 before performing garbage collection. Performing the internal flush would ensure that each of the flash pages reclaimed by the GC has a stable counterpart that can always survive after a crash. Therefore, tasks of GC can be simplified. To reduce the performance penalty of an internal flush, cost of performing the internal flush is mitigated by conducting GC on a batch of blocks (16 blocks in this embodiment).
  • In summary, the method for facilitating recovery from a crash of an SSD according to the disclosure realizes some of the transactional properties (i.e., atomicity and durability) on the SSD with the standard block device interface by modifying firmware (FTL) of the SSD to result in the OPTR SSD according to the disclosure. The OPTR SSD is endowed with strong request-level crash guarantees: a write request is not made durable unless all its prior write requests are durable; each write request is atomic; and all write requests prior to a flush request are guaranteed durable. Consequently, SSD performance may be maintained while achieving an equivalent effect that write requests are completed in order and atomically. As a result, the number of valid post-crash results can be effectively confined and significantly reduced, facilitating tasks of recovery from a crash by applications or filesystems.
  • For the purposes of explanation, a scenario is given as an example where the SSD controller 11 receives a first write request to update a first address range of the nonvolatile memory 12 of the SSD 1 by writing data of the first write request in the first address range, and a second write request to update a second address range of the nonvolatile memory 12 of the SSD 1 by writing data of the second write request in the second address range, wherein the second write request is issued by the host later than the first write request. Additionally, there is no flush request in between the first write request and the second write request, and no barrier request in between the first write request and the second write request. Referring to Table 2 below, in response to a read request to read data in the first address range and the second address range, the SSD controller 11 returns readout data which is guaranteed to belong to one of cases No. 1, 2, 3, 6 and 9 when partial update is allowed (see sub-column A in the last column of Table 2), or belong to one of cases No. 1, 3 and 9 when partial update is not allowed (see sub-column B in the last column of Table 2).
  • TABLE 2
    Case Readout data
    No. 1st address range 2nd address range A B
    1 Fully updated Fully updated V V
    2 Fully updated Partially updated V
    3 Fully updated Not updated at all V V
    4 Partially updated Fully updated
    5 Partially updated Partially updated
    6 Partially updated Not updated at all V
    7 Not updated at all Fully updated
    8 Not updated at all Partially updated
    9 Not updated at all Not updated at all V V
  • To further explain, referring to FIG. 15, a comparison of the number of valid post-crash results for crash recovery between the baseline SSD and the OPTR SSD according to the disclosure is demonstrated. An SSD has four sectors that initially store four version numbers, “0”, “0”, “0” and “0”, respectively. The SSD received four write requests and a flush request before a crash occurred, i.e., “write(0,2)”, “write(1,2)”, “flush”, “write (0, 4)” and “write (2, 2)” in order, where each of the write requests is specified by a logical block address (LBA) and a size in parentheses. Under an assumption that the version number in each of the sectors is increased by one once the sector is written by one of the write requests, the baseline SSD can exhibit 2×2×3×3=36 valid post-crash results because the order of the write requests and the order of sectors being written are not preserved. In contrast, the OPTR SSD according to the disclosure guarantees that the write requests are completed in order and atomically, so the number of valid post-crash results is significantly reduced to three.
  • Since the crash guarantees provided by the SSD 1 according to the disclosure are clear, the chances for developers of future application software or filesystems to make mistakes may be reduced.
  • Moreover, benefited from such strong request-level crash guarantees, operational efficiency of a computer system may be improved by removing unnecessary flush requests to be issued by the filesystem in response to receipt of the instruction of synchronization by the application software.
  • In the description above, for the purposes of explanation, numerous specific details have been set forth in order to provide a thorough understanding of the embodiment. It will be apparent, however, to one skilled in the art, that one or more other embodiments may be practiced without some of these specific details. It should also be appreciated that reference throughout this specification to “one embodiment,” “an embodiment,” an embodiment with an indication of an ordinal number and so forth means that a particular feature, structure, or characteristic may be included in the practice of the disclosure. It should be further appreciated that in the description, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of various inventive aspects, and that one or more features or specific details from one embodiment may be practiced together with one or more features or specific details from another embodiment, where appropriate, in the practice of the disclosure.
  • While the disclosure has been described in connection with what is considered the exemplary embodiment, it is understood that this disclosure is not limited to the disclosed embodiment but is intended to cover various arrangements included within the spirit and scope of the broadest interpretation so as to encompass all such modifications and equivalent arrangements.

Claims (20)

What is claimed is:
1. A method of data synchronization, to be implemented by a computer system that includes a computing apparatus executing an application software and data management software and including a main memory, and that includes a solid-state storage device (SSD) communicable with the computing apparatus, the method comprising:
issuing, by the application software, to the data management software an instruction of synchronization for synchronizing the main memory and the SSD; and
by the data management software in response to receipt of the instruction of synchronization, issuing a command to transfer a journal to the SSD, and issuing a command to transfer a commit record to the SSD immediately subsequent to issuing the command to transfer the journal.
2. The method of data synchronization as claimed in claim 1, further comprising:
by the data management software in response to receipt of the instruction for synchronization, issuing a flush request immediately subsequent to issuing the command to transfer the commit record.
3. The method of data synchronization as claimed in claim 2, the method further comprising:
issuing, by the application software to the data management software, an instruction of barrier-only synchronization for synchronizing the main memory and the SSD; and
by the data management software in response to receipt of the instruction of barrier-only synchronization, issuing a command to transfer a journal to the SSD, and issuing a command to transfer a commit record to the SSD immediately subsequent to issuing the command to transfer the journal.
4. The method of data synchronization as claimed in claim 3, the SSD including a nonvolatile memory and an SSD controller, the nonvolatile memory including a plurality of pages each of which has a spare area, the SSD controller receiving from a host a write request to write data in at least one of the pages, the method further comprising a procedure for facilitating recovery from a crash of the SSD, the procedure includes:
assigning, by the SSD controller according to an order in which the write request was received, a write request identifier (WID) in the spare area of each of at least one written page that is written with the data, where the WID is a unique sequence number for the write request;
assigning, by the SSD controller, a request size in the spare area of each of the at least one written page, where the request size indicates a total number of the at least one of the pages in which the write request is to write the data;
counting, by the SSD controller, a number of appearances of the WID in the at least one written page to result in a WID count;
determining, by the SSD controller, whether the WID count is equal to the request size; and
determining, by the SSD controller, that the write request is completed and is eligible for recovery after a crash of the SSD when it is determined that the WID count is equal to the request size.
5. The method of data synchronization as claimed in claim 1, wherein the data management software is a filesystem.
6. The method of data synchronization as claimed in claim 1, further comprising:
sending, by the data management software, a query to the SSD; and
receiving, by the data management software from the SSD, an indicator responding to the query and indicating that the SSD is in a mode that the SSD refraining from making commit record durable until the SSD determines that the journal is durable.
7. The method of data synchronization as claimed in claim 1, wherein in the step of issuing a command to transfer a commit record to the SSD, the command to transfer a commit record is issued without being succeeded by a flush request.
8. A method for facilitating recovery from a crash of a solid-state storage device (SSD), the SSD including a nonvolatile memory and an SSD controller, the nonvolatile memory including a plurality of pages each of which has a spare area, the SSD controller receiving from a host a write request to write data in at least one of the pages, the method comprising:
assigning, by the SSD controller according to an order in which the write request was received, a write request identifier (WID) in the spare area of each of at least one written page that is written with the data, where the WID is a unique sequence number for the write request;
assigning, by the SSD controller, a request size in the spare area of each of the at least one written page, where the request size indicates a total number of the at least one of the pages in which the write request is to write the data;
counting, by the SSD controller, a number of appearances of the WID in the at least one written page to result in a WID count;
determining, by the SSD controller, whether the WID count is equal to the request size; and
determining, by the SSD controller, that the write request is completed and is eligible for recovery after a crash of the SSD when it is determined that the WID count is equal to the request size.
9. The method for facilitating recovery from a crash as claimed in claim 8, the SSD further including a write cache, the SSD controller further receiving a prior write request and a later write request both of which are to be coalesced in the write cache, the method comprising:
by the SSD controller for each of the coalesced pages used to cache data corresponding to the prior write request and coalesced with the cache pages corresponding to the later write request, recording a page-coalescing record which contains the WID of the prior write request, the request size corresponding to the prior write request, and a WID of the later write request;
counting, by the SSD controller, a number of appearances of the WID of the prior write request in at least one written page that is written with data of the prior write request to result in a WID count for the prior write request;
counting, by the SSD controller, a number of appearances of the WID of the prior write request in the page-coalescing records for the coalesced pages to result in a page-coalescing count corresponding to the prior write request;
determining, by the SSD controller, whether a sum of the WID count for the prior write request and the page-coalescing count corresponding to the prior write request is smaller than the request size corresponding to the prior write request; and
determining, by the SSD controller, that the prior write request is incomplete and is ineligible for recovery after a crash of the SSD when it is determined that the sum of the WID count for the prior write request and the page-coalescing count corresponding to the prior write request is smaller than the page size requested by the prior write request.
10. The method for facilitating recovery from a crash as claimed in claim 9, the method further comprising:
by the SSD controller for each of cache pages in the write cache used to cache data corresponding to the prior write request, tagging the cache page with a dirty flag, a WID tag and a size tag, where the dirty flag indicates whether the cache page is a coalesced page which is coalesced with a cache page used to cache data corresponding to the later write request, the WID tag stores a WID of the prior write request, and the size tag stores a request size which indicates a total number of pages in which the prior write request is to write the data.
11. The method for facilitating recovery from a crash as claimed in claim 8, the SSD controller further receiving a prior write request and a later write request, the method further comprising:
refraining, by the SSD controller, from making the later write request durable until it is determined that the prior write request is durable.
12. The method for facilitating recovery from a crash as claimed in claim 11, further comprising:
transmitting, by the SSD controller in response to receipt of a query, an indicator indicating that the SSD controller is in a mode that the SSD controller refrains from making the later write request durable until it is determined that the prior write request is durable.
13. The method for facilitating recovery from a crash as claimed in claim 8, the SSD controller receiving a flush request from the host after receiving the write request, the method further comprising:
refraining, by the SSD controller, from acknowledging the flush request until it is determined that the write request is completed.
14. The method for facilitating recovery from a crash as claimed in claim 8, the SSD controller further receiving a prior write request and a later write request, the method further comprising:
refraining, by the SSD controller, from keeping the later write request until it is determined that the prior write request is completed.
15. The method for facilitating recovery from a crash as claimed in claim 8, the nonvolatile memory further including a plurality of blocks that include the pages, the method comprising:
by the SSD controller for each of the at least one written page that is written with the data, assigning a logical page number (LPN) in the spare area of the written page;
establishing, by the SSD controller, a full checkpoint through storing an entirety of a logical-to-physical (L2P) mapping table in at least one of the blocks of the nonvolatile memory, the full checkpoint containing a correspondence relationship between the LPN and a physical page number (PPN) for each of the at least one written page; and
establishing, by the SSD controller, an incremental checkpoint through storing a revised portion of the L2P mapping table revised after a latest checkpoint, which is one of the full checkpoint and the incremental checkpoint that was established the last, was established,
wherein each of the full checkpoint and the incremental checkpoint contains a seal page that records a WID corresponding to a latest write request at the time the corresponding one of the full checkpoint and the incremental checkpoint was established, and a PPN corresponding to a next free page of the pages at the time the corresponding one of the full checkpoint and the incremental checkpoint was established.
16. The method for facilitating recovery from a crash as claimed in claim 15, further comprising:
reestablishing, by the SSD controller, the entirety of the L2P mapping table by retrieving the full checkpoint stored in the at least one of the blocks;
revising, by the SSD controller, the L2P mapping table thus established by incorporating the revised portion of the L2P mapping table contained in the incremental checkpoint into the L2P mapping table thus reestablished;
by the SSD controller for each of write requests received after establishment of the latest checkpoint, counting a number of appearances of a WID corresponding to the write request in subsequent pages written with the data of the write request based on the PPN recorded in the seal page of the latest checkpoint to result in a post-crash WID count that indicates a total number of appearances the WID in the subsequent pages;
by the SSD controller for each of the write requests received after establishment of the latest checkpoint, determining whether the write request is completed based on the post-crash WID count and the request size corresponding to the write request;
recovering, by the SSD controller, a group of the write requests received after establishment of the latest checkpoint by using a recovery determination procedure which is related to completeness of the write requests; and
updating, by the SSD controller, the L2P mapping table thus revised by incorporating changes of correspondence relationships between the LPNs and the PPNs of written pages related to the group of the write requests thus recovered.
17. The method for facilitating recovery from a crash as claimed in claim 16, wherein the recovery determination procedure includes:
arranging, by the SSD controller, the write requests received after establishment of the latest checkpoint in an order the write requests were received;
by the SSD controller for every consecutive two of the write requests, determining whether the consecutive two of the write requests are coalesced; and
determining, by the SSD controller, at least one cut, with each of the at least one cut being between the write requests of a consecutive pair, where there is no coalescing for either of the write requests in the consecutive pair, and the write requests before the cut are all completed and serve as the group of the write requests to be recovered.
18. The method for facilitating recovery from a crash as claimed in claim 17, wherein the recovery determination procedure further includes:
determining, by the SSD controller, an optimum cut from among the at least one cut, where a number of the write requests before the optimum cut is the greatest among the at least one cut, and the write requests before the optimum cut serve as the group of the write requests to be recovered.
19. The method for facilitating recovery from a crash as claimed in claim 15, further comprising:
refraining, by the SSD controller while performing garbage collection, from reclaiming one of the pages that is written after establishment of the latest checkpoint.
20. The method for facilitating recovery from a crash as claimed in claim 8, the SSD further including a write cache, the method further comprising:
performing, by the SSD controller, internal flush on the write cache before performing garbage collection.
US17/572,077 2019-02-23 2022-01-10 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device Abandoned US20220129420A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/572,077 US20220129420A1 (en) 2019-02-23 2022-01-10 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962809580P 2019-02-23 2019-02-23
US201962873253P 2019-07-12 2019-07-12
US16/797,944 US11263180B2 (en) 2019-02-23 2020-02-21 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device
US17/572,077 US20220129420A1 (en) 2019-02-23 2022-01-10 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/797,944 Continuation US11263180B2 (en) 2019-02-23 2020-02-21 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device

Publications (1)

Publication Number Publication Date
US20220129420A1 true US20220129420A1 (en) 2022-04-28

Family

ID=72141683

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/797,944 Active US11263180B2 (en) 2019-02-23 2020-02-21 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device
US17/572,077 Abandoned US20220129420A1 (en) 2019-02-23 2022-01-10 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/797,944 Active US11263180B2 (en) 2019-02-23 2020-02-21 Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device

Country Status (2)

Country Link
US (2) US11263180B2 (en)
TW (2) TWI774388B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2023136083A (en) * 2022-03-16 2023-09-29 キオクシア株式会社 Memory system and control method
CN116991757B (en) * 2023-09-26 2023-12-15 四川云海芯科微电子科技有限公司 L2P table incremental data compression method and system
CN117707436A (en) * 2024-02-05 2024-03-15 苏州元脑智能科技有限公司 Firmware mode switching method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307736A1 (en) * 2010-04-12 2011-12-15 Johann George Recovery and replication of a flash memory-based object store
US20120150804A1 (en) * 2010-12-08 2012-06-14 International Business Machines Corporation Multiple contexts in a redirect on write file system
US20120323849A1 (en) * 2011-06-15 2012-12-20 Oracle International Corportion Method For Maximizing Throughput And Minimizing Transaction Response Times On The Primary System In The Presence Of A Zero Data Loss Standby Replica
US8799216B1 (en) * 2011-05-14 2014-08-05 Pivotal Software, Inc. Mirror resynchronization of bulk load and append-only tables during online transactions for better repair time to high availability in databases
US8977896B1 (en) * 2013-03-13 2015-03-10 Emc Corporation Maintaining data integrity in data migration operations using per-migration device error flags
US20160364158A1 (en) * 2015-06-10 2016-12-15 Microsoft Technology Licensing, Llc Recovery in data centers
WO2018055686A1 (en) * 2016-09-21 2018-03-29 株式会社日立製作所 Information processing system
US20190243906A1 (en) * 2018-02-06 2019-08-08 Samsung Electronics Co., Ltd. System and method for leveraging key-value storage to efficiently store data and metadata in a distributed file system
US10445238B1 (en) * 2018-04-24 2019-10-15 Arm Limited Robust transactional memory
US20190384837A1 (en) * 2018-06-19 2019-12-19 Intel Corporation Method and apparatus to manage flush of an atomic group of writes to persistent memory in response to an unexpected power loss
US20210286515A1 (en) * 2020-03-16 2021-09-16 EMC IP Holding Company LLC Hiccup-less failback and journal recovery in an active-active storage system

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7966462B2 (en) * 1999-08-04 2011-06-21 Super Talent Electronics, Inc. Multi-channel flash module with plane-interleaved sequential ECC writes and background recycling to restricted-write flash chips
US7415703B2 (en) * 2003-09-25 2008-08-19 International Business Machines Corporation Loading software on a plurality of processors
US8452929B2 (en) * 2005-04-21 2013-05-28 Violin Memory Inc. Method and system for storage of data in non-volatile media
US8799582B2 (en) * 2008-12-30 2014-08-05 Intel Corporation Extending cache coherency protocols to support locally buffered data
US9122579B2 (en) * 2010-01-06 2015-09-01 Intelligent Intellectual Property Holdings 2 Llc Apparatus, system, and method for a storage layer
US8788842B2 (en) * 2010-04-07 2014-07-22 Apple Inc. System and method for content protection based on a combination of a user PIN and a device specific identifier
TW201142621A (en) * 2010-05-17 2011-12-01 Chunghwa Telecom Co Ltd Cross information system data synchronization method and system
WO2012083308A2 (en) * 2010-12-17 2012-06-21 Fusion-Io, Inc. Apparatus, system, and method for persistent data management on a non-volatile storage media
US9274937B2 (en) * 2011-12-22 2016-03-01 Longitude Enterprise Flash S.A.R.L. Systems, methods, and interfaces for vector input/output operations
US10359972B2 (en) * 2012-08-31 2019-07-23 Sandisk Technologies Llc Systems, methods, and interfaces for adaptive persistence
US8950009B2 (en) * 2012-03-30 2015-02-03 Commvault Systems, Inc. Information management of data associated with multiple cloud services
US10025704B2 (en) * 2013-12-27 2018-07-17 Crossbar, Inc. Memory system including PE count circuit and method of operating the same
US10073902B2 (en) * 2014-09-24 2018-09-11 Microsoft Technology Licensing, Llc Snapshot and replication of a multi-stream application on multiple hosts at near-sync frequency
US20160132251A1 (en) * 2014-11-11 2016-05-12 Wisconsin Alumni Research Foundation Operating method of storage device and data writing method for writing data into storage device
US10108503B2 (en) * 2015-08-24 2018-10-23 Western Digital Technologies, Inc. Methods and systems for updating a recovery sequence map
TWI604308B (en) * 2015-11-18 2017-11-01 慧榮科技股份有限公司 Data storage device and data maintenance method thereof
TW201727491A (en) * 2016-01-20 2017-08-01 后旺科技股份有限公司 Accessing method of composite hard-disk drive
CN107295080B (en) * 2017-06-19 2020-12-18 北京百度网讯科技有限公司 Data storage method applied to distributed server cluster and server
CN107870744A (en) 2017-10-27 2018-04-03 上海新储集成电路有限公司 The hybrid hard disk array storage system and method for a kind of asynchronous mirror image
US20190243578A1 (en) * 2018-02-08 2019-08-08 Toshiba Memory Corporation Memory buffer management for solid state drives
US10909030B2 (en) * 2018-09-11 2021-02-02 Toshiba Memory Corporation Enhanced trim command support for solid state drives
US10860483B2 (en) * 2019-04-30 2020-12-08 EMC IP Holding Company LLC Handling metadata corruption to avoid data unavailability
US11113188B2 (en) * 2019-08-21 2021-09-07 Microsoft Technology Licensing, Llc Data preservation using memory aperture flush order

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307736A1 (en) * 2010-04-12 2011-12-15 Johann George Recovery and replication of a flash memory-based object store
US20120150804A1 (en) * 2010-12-08 2012-06-14 International Business Machines Corporation Multiple contexts in a redirect on write file system
US8799216B1 (en) * 2011-05-14 2014-08-05 Pivotal Software, Inc. Mirror resynchronization of bulk load and append-only tables during online transactions for better repair time to high availability in databases
US20120323849A1 (en) * 2011-06-15 2012-12-20 Oracle International Corportion Method For Maximizing Throughput And Minimizing Transaction Response Times On The Primary System In The Presence Of A Zero Data Loss Standby Replica
US8977896B1 (en) * 2013-03-13 2015-03-10 Emc Corporation Maintaining data integrity in data migration operations using per-migration device error flags
US20160364158A1 (en) * 2015-06-10 2016-12-15 Microsoft Technology Licensing, Llc Recovery in data centers
WO2018055686A1 (en) * 2016-09-21 2018-03-29 株式会社日立製作所 Information processing system
US20190243906A1 (en) * 2018-02-06 2019-08-08 Samsung Electronics Co., Ltd. System and method for leveraging key-value storage to efficiently store data and metadata in a distributed file system
US10445238B1 (en) * 2018-04-24 2019-10-15 Arm Limited Robust transactional memory
US20190384837A1 (en) * 2018-06-19 2019-12-19 Intel Corporation Method and apparatus to manage flush of an atomic group of writes to persistent memory in response to an unexpected power loss
US20210286515A1 (en) * 2020-03-16 2021-09-16 EMC IP Holding Company LLC Hiccup-less failback and journal recovery in an active-active storage system

Also Published As

Publication number Publication date
TWI737189B (en) 2021-08-21
TW202103155A (en) 2021-01-16
US20200272604A1 (en) 2020-08-27
US11263180B2 (en) 2022-03-01
TWI774388B (en) 2022-08-11
TW202134882A (en) 2021-09-16

Similar Documents

Publication Publication Date Title
US10176190B2 (en) Data integrity and loss resistance in high performance and high capacity storage deduplication
JP6556911B2 (en) Method and apparatus for performing an annotated atomic write operation
US20220129420A1 (en) Method for facilitating recovery from crash of solid-state storage device, method of data synchronization, computer system, and solid-state storage device
WO2017190604A1 (en) Transaction recovery method in database system and database management system
US8738850B2 (en) Flash-aware storage optimized for mobile and embedded DBMS on NAND flash memory
US8874515B2 (en) Low level object version tracking using non-volatile memory write generations
US9009428B2 (en) Data store page recovery
CN108431783B (en) Access request processing method and device and computer system
EP2590078B1 (en) Shadow paging based log segment directory
US9778860B2 (en) Re-TRIM of free space within VHDX
WO2013174305A1 (en) Ssd-based key-value type local storage method and system
US9146928B1 (en) Techniques for storing metadata of a filesystem in persistent memory
US20170124104A1 (en) Durable file system for sequentially written zoned storage
US20170123928A1 (en) Storage space reclamation for zoned storage
CN103617097A (en) File recovery method and file recovery device
US10977143B2 (en) Mirrored write ahead logs for data storage system
KR20200060220A (en) NVM-based file system and method for data update using the same
US11132145B2 (en) Techniques for reducing write amplification on solid state storage devices (SSDs)
KR20170054767A (en) Database management system and method for modifying and recovering data the same
US20170123714A1 (en) Sequential write based durable file system
US20230142948A1 (en) Techniques for managing context information for a storage device
WO2020088499A1 (en) Journaling overhead reduction with remapping interface
US10452496B2 (en) System and method for managing storage transaction requests
KR101966399B1 (en) Device and method on file system journaling using atomic operation
US20230409608A1 (en) Management device, database system, management method, and computer program product

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION