CN112711492A - Firmware-based solid state drive block failure prediction and avoidance scheme - Google Patents

Firmware-based solid state drive block failure prediction and avoidance scheme Download PDF

Info

Publication number
CN112711492A
CN112711492A CN202011144198.2A CN202011144198A CN112711492A CN 112711492 A CN112711492 A CN 112711492A CN 202011144198 A CN202011144198 A CN 202011144198A CN 112711492 A CN112711492 A CN 112711492A
Authority
CN
China
Prior art keywords
block
data
ssd
suspect
log data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011144198.2A
Other languages
Chinese (zh)
Inventor
N.埃亚西
崔昌皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/701,133 external-priority patent/US11567670B2/en
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Publication of CN112711492A publication Critical patent/CN112711492A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1068Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices in sector programmable memories, e.g. flash disk

Abstract

A Solid State Drive (SSD) is disclosed. An SSD may include flash memory for data, the flash memory organized into a plurality of blocks. The controller may manage reading and writing data from and to the flash memory. The metadata store may store device-based log data to prevent errors in the SSD. The identification firmware may identify the block in response to the device-based log data. In some embodiments of the inventive concept, the verification firmware may determine whether a suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data.

Description

Firmware-based solid state drive block failure prediction and avoidance scheme
Cross Reference to Related Applications
This application claims the benefit of U.S. provisional patent application serial No. 62/926,420 filed on 25/10/2019, which is incorporated herein by reference for all purposes.
Technical Field
The present inventive concept relates generally to storage devices and, more particularly, to providing fine-grained block failure prediction.
Background
A NAND flash Solid State Drive (SSD) failure in the field may cause the server to shut down, thereby compromising performance and availability of the data center level application. To prevent such unexpected failures, systems employing SSDs typically use a simple threshold-based model to avoid such failures by replacing the drive before the failure occurs. Such protection mechanisms may result in a high false alarm level or an inability to predict/avoid all SSD failures. Furthermore, in the event of a physical error, the SSD cannot recover from the error, thus avoiding a device failure.
There is still a need to provide fine-grained block failure prediction.
Disclosure of Invention
Drawings
Fig. 1 illustrates a system including a Solid State Drive (SSD) that may perform fine-grained block failure prediction according to an embodiment of the inventive concept.
Figure 2 shows a detail of the machine of figure 1.
Fig. 3 shows a detail of the SSD of fig. 1.
FIG. 4 illustrates example block-based data that may be used by the SSD of FIG. 1.
FIG. 5 illustrates device-based log data that may be used by the SSD of FIG. 1.
FIG. 6 illustrates the identification firmware and verification firmware of FIG. 3 operating to determine whether a particular block is expected to fail.
Fig. 7A-7B illustrate a flowchart of an example process of determining whether a block is expected to fail according to an embodiment of the inventive concept.
Detailed Description
Reference will now be made in detail to embodiments of the present inventive concept, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present inventive concepts. It will be appreciated, however, by one skilled in the art that the inventive concept may be practiced without such specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail as not to unnecessarily obscure aspects of the embodiments.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first module may be referred to as a second module, and similarly, a second module may be referred to as a first module, without departing from the scope of the inventive concept.
The terminology used in the description of the inventive concepts herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concepts. As used in the description of the inventive concept and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It is also to be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The components and features of the drawings are not necessarily to scale.
A firmware-based Solid State Drive (SSD) fault protection mechanism is proposed for early detection and error isolation. The failure mechanism may prevent drive failure, or at least prevent premature drive replacement.
An SSD contains multiple flash memory chips, each containing many blocks. A block may contain any number of pages. The size of a page is typically several kilobytes, and is typically the smallest unit for reading and writing data to an SSD. The SSD controller (firmware) may include all the logic needed to service read and write requests, run wear leveling algorithms, and run error recovery processes.
Each SSD page may include Error Correction Code (ECC) metadata that the SSD controller may use to recover and repair a limited number of bit errors (typically 1-2 bit errors). However, if the number of bit errors due to hardware failures exceeds a certain number, the SSD controller may not be able to correct the errors and thus may provide the corrupted data to the host. If such failures occur multiple times, the entire device may need to be selected for replacement, which can incur high costs for the device manufacturer and compromise the performance and availability of the application as the server is subsequently shut down.
On the other hand, when data is written to a flash page (programming operation), if an error occurs, the page will be marked as "failed" and will not be used anymore. Once some pages in a block fail, the entire block is eliminated. SSDs typically retain some spare blocks to replace such retired blocks. If the number of available spare blocks of the SSD is insufficient (e.g., more than 90% of the spare blocks used), the device may need to be replaced.
In some cases, most of the blocks in the drive may be functioning properly (normal blocks), while only a small portion is faulty (bad blocks). If the read operation is for a bad block and often fails (reading corrupted data or a read failure due to a hardware failure), the entire drive may need to be replaced to prevent future failures and avoid data loss. However, if fine-grained block errors/failures can be predicted early and then avoided/recovered, bad blocks can be eliminated/retired, which will prevent the SSD from storing data on these blocks, thereby avoiding further failures and data corruption/loss.
Predicting fine-grained (block-level) errors/failures in SSDs (having thousands of blocks) is not straightforward, and requires (i) storing a large amount of historical (time-series) data corresponding to each block, and (ii) processing/analyzing a very large data set to predict and avoid failures. With respect to the amount of history data required, whether such metadata information is stored in DRAM space on the SSD or in the flash memory itself, the amount of data to be stored increases as the failure history information increases. Storing this information may result in high storage costs, and may even sacrifice a large portion of the capacity of the drive. Since memory devices contain only a limited amount of DRAM and are highly sensitive to their price per GB, the data storage requirements are not trivial, nor is the sacrificing large storage capacity of the device to store such time-series of failures a simple and efficient approach.
With respect to the processing required to make the prediction, SSDs typically have limited processing power, primarily for their internal operations (e.g., flash translation layer, wear leveling, and scheduling). It is not simply feasible to process large amounts of data inside an SSD to predict block-level failures/errors.
To address the above-described challenges with block-level failure prediction, embodiments of the present inventive concept take advantage of temporal and spatial locality (locality) of physical errors in each block and/or page within each block. Temporal locality refers to frequent occurrence of errors in the same physical page and/or block; spatial locality refers to the occurrence of an error in an adjacent physical portion (e.g., a page or block). By exploiting the locality of occurrence of errors, only very limited data associated with a few errors in the past (rather than the error history of the device) is needed to be used for predicting block failures. The intuition of this idea is that a page/block that has generated erroneous data is likely to generate errors in the future. Also, when a page in one block fails, adjacent pages in the same block are likely to generate errors because they are all in the same physical component.
Predicting block level failures
As mentioned above, predicting block-level failures is not easy due to its capacity and processing requirements. A simple approach is to use fine-grained historical log data corresponding to thousands of blocks to make accurate predictions, but this data set grows in size over time and may make the storage of user data small. In contrast, embodiments of the inventive concept use a two-step identification and verification mechanism to locate suspect blocks and then use a learning-based model to verify future failures of the blocks.
First, to identify a suspect drive using locality in physical errors, only the most recent error information needs to be stored. For example, only the last k entries of the error history (i.e., the k most recent events) are retained, and the entire error history is not retained, throughout the operation of the drive. Errors earlier than the last k entries may be discarded. Although only information about the latest error is stored, this limited information may help identify the suspect block due to the location of the error. For example, if 10 of the past 100 errors were made by a particular page within an identified block, this fact indicates that an error may occur in the same page or in its neighboring pages in the same block in the future. Thus, given information about the past k errors, a suspect block may be identified using data that is potentially orders of magnitude less than the entire error history of the device.
In the second step, although suspicious blocks are likely to generate errors in the near future, simply designating them as defective blocks and eliminating them may result in inefficiencies. Such threshold-based identification mechanisms may not accurately capture failure information and may generate many false alarms, resulting in elimination of properly functioning blocks and wasted drive capacity. To prevent such inaccurate and threshold-based prediction, after a suspect block is identified, a prediction model (which has been trained) may be used to more accurately predict block failures.
Obtaining block level parameters
Running a previously trained predictive model requires time series information about the suspect block to verify its failure. Tracking such fine-grained information may result in high capacity demands that may far exceed the capacity of the storage device. Rather, embodiments of the inventive concept extract and obtain some block-based log data (directly or with some modifications) from the available drive-based log data with respect to the definition and interpretation of each parameter. In particular, to construct a parameter set for a suspect block, i.e., set S ═ { param1, param2, … }, and input it into the prediction module, log data can be divided into two categories:
(i) accurate block-based log data: s _ Block _ precision { p1, p2, … } and
(ii) approximate block-based log data: s _ Block _ approve { a1, a2, … }.
Then, to derive the set S, S ═ S _ Block _ Precise ≦ S _ Block _ Approx, this is equivalent to S _ Block _ Precise + S _ Block _ Approx, since the two sets are disjoint. For parameters directly related to the error/failure information (e.g. the number of read errors, write errors and erase errors), the exact information per block can be stored. The amount of block-based data required is negligible (only a few megabytes of area is required for a 1TB SSD), and can be managed by SSDs that already contain several GB of DRAM space. Moreover, such data does not relate to time series information and is only one counter for each parameter per block.
To extract the time series log data, such information may be derived from global drive level error information maintained for the past k errors. Since the suspect block was selected based on the past k error events, its latest error information is already present in the global driver level error data. The latest k erroneous data associated with the drive may contain accumulated error information for one block, which may be derived by adding the error counter for one block to the new error data. Note that the counter for each block contains only accumulated error information. The global error information contains complete data about the last k errors, which may include errors made by the suspect block.
The approximate parameters of the Block (i.e., S _ Block _ error) may be extracted from the driver-level error information. The intuition of this idea is that some log information of the suspect block can be derived approximately from the drive level parameters, since they refer to the state of the drive/block, not the error information. In other words, these parameters may be averaged across all blocks, and thus may represent a single block. For example, certain parameters (e.g., "read times" and "write times") are based on the total number of reads and writes to the drive, and are an indication of drive life, and may be averaged over all blocks to approximate the corresponding parameters of the suspect block.
By combining history-based drive information with counter-based block-level log data, a set of parameters for the suspect block may be generated and input into the prediction module. Then, in the case of a failure alarm for a suspect block, that block may be phased out in advance to avoid further errors associated with that block and consequent drive replacement. Thus, instead of maintaining time series data for each block, which may grow gradually, only a light-weight counter for each block needs to be maintained. Furthermore, for time series driver information, only the last k error events can be maintained, which only account for a few kilobytes of data. Through such optimization, the data set size and computational/processing requirements needed to perform fine-grained block-level failure prediction can be addressed. The amount of data required for the proposed enhancement function is less than the original block-level time-series log data, and the subsequent processing of such small amounts of data can be very fast and can be performed in real-time.
Required metadata and data structures
As previously described, only lightweight error information/counters for each block need be maintained. Assuming that the SSD contains n blocks, only n entries are needed. On the other hand, for drive level information, we retain only information of the past k error events. For each of the k error events, information may be stored about the physical location of the error (page, block), the time at which the error occurred (timestamp), the error counter for the block at that time, and the SMART log data on the SSD.
As previously mentioned, the overhead required by embodiments of the inventive concept is very low. Assuming that the storage capacity of an SSD is 1TB, each block is 256 pages, and the page size is 4 KB:
page number 1TB/4KB 256,000,000
Block number 256,000,000/256 1,000,000
If there are three error attributes per block (counters for the number of read, write and erase errors per block, each possibly a 4-byte integer), then the total memory space required for the block-level erroneous data will be
Figure BDA0002739156110000061
For drive level information, assuming k is 100 (i.e., information about the last 100 error events is stored), each error event requires 1KB of memory space. Thus, the total capacity required for drive-level metadata is 100 KB. Therefore, the total memory overhead will be 12.1MB, which is negligible for SSDs containing several GB of DRAM space.
Note that the error log data of an SSD is typically specific to the firmware and device model. To illustrate, some parameters of log data that may be stored include critical warnings, available space, data units read, data units written, power cycles, power-on time, unsafe shutdown, media errors, warning temperature time, and critical combined temperature time. Embodiments of the inventive concept may also store other parameters.
Execution flow
In the event of an error occurrence in block I, the error counter of the block level metadata may be read and updated. The driver-level metadata may then be updated to reflect the new error event information. The information stored in the drive-level metadata may include parameters discussed above, such as location of error (page ID/block ID), timestamp, etc.
Identifying suspicious blocks
To identify a suspect block, the drive-level metadata table may be periodically scanned to see if the block produced a duplicate error (by checking the block ID field in this table). This scanning may be performed at fixed time intervals (e.g., every minute), or may be performed after a certain number of errors are logged (e.g., after each error, after every five errors, etc.). If multiple past errors have occurred in the same block, the block may be added to the suspect block pool: the SSD may then temporarily avoid using it to store data (but may still read since it may contain valid data). More specifically, a particular block of the most recent k errors is marked as "suspect" if the number of events corresponding to that block is above a threshold.
There are two different ways to set the threshold:
(1) a static threshold a is defined. When the number of error events corresponding to a particular block exceeds a% of the last k errors, then the block is marked as suspect. The threshold parameter a may be adjusted based on protection/reliability level requirements. For example, setting α to 10 means that if more than 10% of the last k error events relate to a particular block ID, the block is marked as suspect. Alternatively, α may be a fixed number instead of a percentage: that is, setting α to 10 means that if 10 or more of the last k error events involve a particular block ID, the block is marked as suspect.
(2) A threshold based on the average is defined. Such a threshold may be obtained by averaging the total number of errors (in the device log) for all blocks in the drive. A suspicious block identification decision can be made based on this threshold (either directly or implicitly): a particular block may be marked as suspect if it has experienced more than its share of errors in the last k error events. For example, assume that a device with a total number of blocks of 256,000 experiences a total of 100 errors. The ratio of the number of errors to the number of blocks is 100/256,000-1/2,560. If a block encounters more than this number of errors, the block may be marked as suspect.
Note that even a single error may cause a block to be marked as suspect until the number of errors is roughly proportional to the number of blocks. To prevent each false trigger from marking a block as suspect, the average-based threshold may be scaled up (or down) by any desired factor. Thus, for example, a threshold based on an average value may be multiplied by a number (e.g., 10,000) to produce a threshold that is effectively greater than one. The scaling value may also vary over time or in response to the number of errors to prevent the average-based threshold from becoming too large.
Prediction
Once a suspect block is identified, a set of parameters (set S) corresponding to the suspect block may be generated and input into the failure prediction module. As described above, a portion of S may be based on block-level error information, and a portion of S may be derived from driver-level log information, which may be extracted from driver-level metadata stored for the past k errors (which reflects an average of all blocks to have an estimate of the block-level data). Any algorithm may then be used to process the data and determine whether the block is actually predicted to fail. Example algorithms that may be used include logistic regression algorithms or random forest algorithms. If the prediction indicates that the block may fail in the future, the block may be eliminated by first copying its valid data into other blocks and then removing the suspect block from the list of available blocks. To minimize the processing power required by the prediction module, the prediction module need not execute for all blocks, nor need it execute continuously. Alternatively, the prediction module may be triggered only for a suspect block and only when the block is identified as a suspect block.
As described above, any desired prediction module may be selected that uses some time series data in order to predict an event. Examples of prediction modules may be machine learning based failure prediction models (examples of which include random forests, logistic regression, outlier detection, anomaly detection, etc.) that have been primarily trained and whose prediction information (e.g., optimized weights) has been embedded in the driver firmware. Thus, after receiving past error information, the model can predict the failure probability of a particular block by running a lightweight computation.
Fig. 1 illustrates a system including a Solid State Drive (SSD) that may perform fine-grained block failure prediction according to an embodiment of the inventive concept. In fig. 1, a machine may include a processor 105, a memory 110, and a solid state drive (115). The processor 105 may be any kind of processor: such as Intel Xeon, Celeron, Itanium or Atom processors, AMD Opteron processors, ARM processors, etc. While fig. 1 shows a single processor 105, machine 120 may include any number of processors, each of which may be single-core or multi-core processors, and may be mixed in any desired combination.
The processor 105 may be coupled to a memory 110. The memory 110 may be any kind of memory, such as flash memory, Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), permanent random access memory, Ferroelectric Random Access Memory (FRAM), or non-volatile random access memory (NVRAM) such as Magnetoresistive Random Access Memory (MRAM), etc. The memory 110 may also be any desired combination of different memory types and may be managed by the memory controller 125. The memory 110 may be used to store data that may be referred to as "short-term": that is, data that is not expected to be stored for a long time. Examples of short-term data may include temporary files, data used locally by the application (data that may have been copied from other storage locations), and so forth.
The processor 105 and memory 110 may also support an operating system under which various applications may run. These applications may issue requests to read data from or write data to memory 110 or SSD 115. SSD115 may be used, for example, to store initial parameters (or a range of values for the initial parameters, and the type of behavior that the range of values represents) used to initialize the simulation. The SSD115 may be accessed using a device driver 130. Although fig. 1 illustrates SSD115, embodiments of the inventive concept may include other storage device formats that may benefit from fine-grained block failure prediction: any reference below to "SSD" should be understood to encompass such other embodiments of the inventive concept.
Figure 2 shows a detail of the machine of figure 1. In fig. 2, generally, machine 120 includes one or more processors 105, which may include a memory controller 125 and a clock 205, which may be used to coordinate the operation of the components of the machine. By way of example, processor 105 may also be coupled to memory 110, and memory 110 may include Random Access Memory (RAM), Read Only Memory (ROM), or other state preserving media. The processor 105 may also be coupled to a storage device 115 and a network connector 210, the network connector 210 may be, for example, an ethernet connector or a wireless connector. The processor 105 may also be connected to the bus 215, and in other components, a user interface 220 and input/output interface ports, which may be managed using an input/output engine 225, may be connected to the bus 215.
Fig. 3 shows a detail of the SSD of fig. 1. In fig. 3, SSD115 may include host interface logic 305, which host interface logic 305 may provide an interface between SSD115 and a host computer (e.g., machine 120 of fig. 1). The SSD115 may also include an SSD controller 310, various channels 315-1, 315-2, 315-3, and 315-4 along which various flash memory chips 320-1, 320-2, 320-3, 320-4, 320-3, 320-6, 320-7, and 320-8 may be arranged. Although FIG. 3 shows four channels and eight flash chips, one skilled in the art will recognize that there may be any number of channels including any number of flash chips.
Within each flash chip, the space may be organized into blocks, which may be further subdivided into pages. For example, flash chip 320-7 is shown to include blocks 1 through n (identified as blocks 325 and 330), each of which may contain pages numbered 1 through m. Although there may be multiple pages assigned the same number (e.g., page 1) in multiple blocks, the combination of the page Identifier (ID) and the block ID may uniquely identify a particular page in the flash chip 320-7. (alternatively, a combination of page ID, block ID, and flash chip ID may uniquely identify a page in SSD 115.)
The reason for the distinction between blocks and pages is how the SSD handles read, write, and erase operations. The page is typically the smallest unit of data that can be read or written on the SSD. The page size may vary as desired: for example, one page may be 4KB of data. If less than the entire page of content is to be written, the excess space is "unused".
However, while pages can be written and read, SSDs typically do not allow data to be overwritten: that is, existing data may not be "replaced" with new data. Instead, when data is to be updated, the new data will be written to a new page on the SSD, and the original page will be invalid (marked erasable). Accordingly, an SSD page typically has one of three states: idle (ready to write), valid (containing valid data) and invalid (no longer containing valid data, but not available until erased) (the exact names of these states may differ).
However, the block is an elementary data unit that can be erased, although pages can be written and read separately. That is, pages are not erased individually: all pages in a block will be erased at the same time. For example, if a block contains 256 pages, all 256 pages in the block will be erased at the same time. This arrangement may cause some management problems for the SSD: if the block selected for erasure still contains some valid data, the valid data may need to be copied to a free page elsewhere on the SSD before erasing the block. (in some embodiments of the inventive concept, the unit of erase may be other than a block: e.g., it may be a super block: a collection of blocks.)
SSD controller 310 can include flash translation layer 335, metadata storage 340, identification firmware 345, and verification firmware 350. The flash translation layer 335 may handle translation of logical block addresses (as used by the processor 105 of FIG. 1) and physical block addresses where data is stored in the flash chips 320-1 through 320-8. Metadata store 340 may store metadata information used by SSD115 in performing fine-grained block failure prediction. Identification firmware 345 may be used to identify blocks suspected of potentially failing using metadata information stored in metadata storage 340: verification firmware 350 may then again use the metadata information stored in metadata storage 340 to determine whether the suspect block may in fact have failed. The identification firmware 345 and verification firmware 350 may be executed using a processor (not shown in fig. 3) that may be part of the SSD 115: for example, using processing power inherent to SSD controller 310.
Fig. 4 illustrates example block-based data that may be used by SSD115 of fig. 1. In fig. 4, block-based data 405 may include data for each block, which may be stored in metadata storage 340 of fig. 3. For example, fig. 4 shows data for blocks 1 through n, although data for any number of blocks (up to the data for each block in SSD115 of fig. 1) may be included. The data for each block may include counters 410-1, 410-2, and 410-3, which may store the number of read errors, the number of write errors, and the number of erase errors that have occurred for the respective block. Note that since the SSD115 of fig. 1 is manufactured: the block-based data 405 may also be referred to as accurate block-based data, as compared to other data discussed below with reference to FIG. 5, so the counters 410-1, 410-2, and 410-3 may be cumulative.
Each of counters 410-1, 410-2, and 410-3 may require 4 bytes per counter. Since each of the counters 410-1, 410-2, and 410-3 includes three counters (each for the number of read errors, write errors, and erase errors), a total of 12 bytes may be used to store each of the counters 410-1, 410-2, and 410-3. Multiplying 12 bytes by the number of blocks on SSD115 of fig. 1 may calculate the overhead imposed by block-based data 405.
For example, consider an SSD that provides a total of 1TB storage capacity, where each block contains 256 pages, each page containing 4KB of data. 268,435,456 pages are required to store 1TB of data in 4KB pages. 256 pages per block, which means that an SSD will contain 1,048,576 blocks in total. In the case of 12 bytes of three counters per block, the block-based data 405 requires a total of about 12MB of storage space, only a little more than one-tenth of the total storage space provided by the SSD.
Note that counters 410-1, 410-2, and 410-3 indicate the number of errors that have occurred in each block. These errors may be grouped in one page or several pages in the block, or the errors may be spread among several pages in the block. In this manner, block-based data 405 provides some spatial location between the determined errors, as pages with one error are more likely to have other errors, as are other pages nearby (as compared to pages in other blocks).
FIG. 5 illustrates device-based log data that may be used by SSD115 of FIG. 1. In FIG. 5, device-based log data 505 is shown. The device-based log data 505 may include data regarding a particular error that has occurred on the SSD115 of FIG. 1, and the device-based log data 505 may be stored in the metadata storage 340 of FIG. 3. However, rather than storing data for all errors that have occurred on SSD115 of FIG. 1, device-based log data 505 may store data for the most recent k errors that have occurred on SSD115 of FIG. 1. Any older errors may be discarded. Thus, errors 1 through k may not be the first k errors that occurred on SSD115 of fig. 1, but may be the most recently occurring k errors (earlier errors have been previously discarded). k can be any desired value: larger values provide more information that can be used to determine whether a particular block is predicted to fail, but at the cost of requiring more data to be stored (thereby increasing overhead).
Various data may be stored for each error currently being tracked. For example, as shown in errors 510-1, 510-2, and 510-3, the IDs of the page and block, the time at which the error occurred, the error counter for the block (i.e., the value of the exact block-based data 405 of FIG. 4 for the block in which the error occurred at the time the error occurred), the timestamp at the time the error occurred, and other log data (e.g., as shown in SMART log data 515) may be stored together. The data shown in FIG. 5 of the device-based log data 505 represents one embodiment of the inventive concept: other embodiments may include more, less, or other data than shown in fig. 5, without limitation.
In contrast to the exact block-based data 405 of FIG. 4, the device-based log data 505 may be used to derive approximate block-based data. Since device-based log data 505 stores only information about the most recent k errors on SSD115 of FIG. 1, device-based log data 505 (and the approximate block-based data derived therefrom) provides the ability to determine some temporal location in the errors, thereby allowing for identification of blocks that have experienced more recent errors than other blocks.
Because only the most recent k errors of data are stored, the overhead required to store the device-based log data 505 can be calculated by knowing the size of the data stored for one error and then multiplying by the number of errors that stored the data. For example, if 100 most recently erroneous data are stored, and the amount of storage per error is 1KB, the total amount of storage required to store device-based log data is 100 KB. Again, relative to the size of the SSD115 of FIG. 1, as shown in FIG. 1, this storage overhead is only a small fraction of one percent of the overall size of the SSD115 of FIG. 1 (approximately one hundred thousand of a 1TB SSD).
The value of k may be preset. Then, the value of k may remain constant over the lifetime of SSD115 of fig. 1. Alternatively, k may be configurable and may vary over time according to user preferences. For example, as SSD115 of fig. 1 grows in usage time, more error information may be desired.
Fig. 6 illustrates the operation of the identification firmware 345 and verification firmware 350 of fig. 3 to determine whether a particular block is expected to fail. In fig. 6, the recognition firmware 345 may receive device-based log data 505 (which may include SMART log data 515, not shown in detail in fig. 6). Identification firmware 345 may then identify the block in which each occurred error is stored. If the number of errors in a particular block exceeds a certain threshold among the most recent k errors, the block in question may be suspected of being about to fail. Thus, the identification firmware 345 may generate approximate block-based data 605 from the device-based log data 505.
Any desired threshold may be used to determine whether a particular block is suspected of imminent failure. For example, a predetermined, user-specified threshold may be set, wherein a particular block may soon fail if the number of errors that occurred in that block is greater than the threshold of the last k errors. The threshold may be a number (e.g., 10 of the last k errors) or a percentage (e.g., 10% of the last k errors). The threshold value may also be adjusted according to the number of errors that actually occur. For example, if the threshold is set to a percentage of the total number of errors, a block that encountered the first error will be automatically suspected, as 100% of the errors will be associated with that block. To avoid this result, the identification firmware 345 may not run until the number of errors that have occurred in the SSD115 of fig. 1 exceeds other values: this will prevent the identification firmware 345 from identifying blocks suspected of failing prematurely.
Another threshold that may be used is to calculate the average number of errors per block across the SSD. That is, the total number of errors that occurred (since the SSD was manufactured) may be calculated and divided by the total number of blocks in the SSD. Any block that has too many errors may be suspected of failing quickly relative to the average. A percentage of this average value may also be used. Also, the average value (or its usage) may be adjustable. For example, until the number of errors experienced by the SSD approaches the number of blocks in the SSD, any block that experiences a single error will have a greater number than the average number of errors and will be automatically considered suspect by the identification firmware 345. Until the number of errors exceeds some predetermined value, the identification firmware 345 may begin to consider whether it is suspected that the block will soon fail. Alternatively, the identification firmware 345 may calculate the relative percentage of errors (relative to k) that occur in a particular block and compare this value to the average number of errors per block across the SSD: if the block experiences a higher percentage of the most recent k errors that exceed the average, then the identification firmware 345 may suspect that the block soon failed.
In other embodiments of the inventive concept, identifying firmware 345 may suspect that a block is about to fail not because it has recently experienced a larger share of errors, but rather based on an overall error count. For example, assume that k is chosen to be 100 (i.e., the device-based log data stores only the 100 most recent errors). If every 50 th error occurs in a particular block, the block may not be considered suspect based on having a higher percentage of errors in the most recent k errors. But from the history of the device, the block has one error out of every 50 errors, which may mean that the block experiences more errors in total than any other block. Thus, the identification firmware 345 may examine the accurate block-based data 405 of FIG. 4 for blocks suspected of failing soon, and blocks whose total error count exceeds a certain threshold may be identified as suspect blocks even if the block does not exceed the threshold of the approximation-based block-based data 605.
In some embodiments of the inventive concept, the identification firmware 345 may examine the accurate block-based data 405 of FIG. 4 without regard to the device-based log data 505: in such embodiments of the inventive concept, identification firmware 345 may check the total error count for each block in SSD115 of FIG. 1. In other embodiments of the inventive concept, identification firmware 345 may examine the accurate block-based data 405 of fig. 4, only for blocks that have experienced one (or more) of the most recent k errors: in such embodiments of the inventive concept, the identification firmware 345 may consider the exact block-based data 405 of FIG. 4 in conjunction with the device-based log data 505.
Regardless of the particular method used, the identification firmware 345 may operate simply by comparing two values to determine whether a particular block is considered suspect. This process simplifies the identification of suspect blocks.
The identification firmware 345 may operate according to any desired schedule. The recognition firmware 345 may operate at regular intervals: for example, every minute, every 10 minutes or every day (or larger and larger intervals are also possible). Alternatively, the identification firmware 345 may operate after a certain number of errors have occurred: for example after every error or after every fifth error (other numbers of errors are possible).
Note that although identifying firmware 345 is described as using device-based log data 505 in determining whether a block is suspect, embodiments of the inventive concept may use the exact block-based data 405 of FIG. 4 instead of, or in addition to, device-based log data 505. For example, identification firmware 345 may simply identify a block as a suspect block based on the sum of counters 410-1 of FIG. 4 exceeding a predetermined threshold number.
Verification firmware 350 may be invoked once identification firmware 345 has identified a block suspected of failing soon. Verification firmware 350 may use the accurate block-based data 405 of fig. 4, and in particular, block counter 410-1, which is applicable to blocks suspected of failing soon, and approximate block-based data 605, to determine whether blocks that identify firmware 345 as suspicious blocks are actually predicted to fail soon. Verifying firmware 350 may use any desired method to make this determination. For example, verification firmware 350 may implement a machine learning based failure prediction model, such as random forest, logistic regression, outlier detection, anomaly detection, etc., which may be trained and whose information for prediction (e.g., optimized weights) is already embedded in verification firmware 350. Verifying firmware 350 may then produce a result 610 indicating whether it is actually predicted that the block suspected of imminent failure by identifying firmware 345 is imminent to fail.
Although fig. 6 shows the identification firmware 345 as generating approximate block-based data 605 from the device-based log data 505, embodiments of the inventive concept may have other components to generate the approximate block-based data 605. For example, verification firmware 350 may take device-based log data 505 and generate the approximate block-based data 605 itself.
Note that the identification firmware 345 and the verification firmware 350 have different functions. The identification firmware 345 only identifies blocks suspected of failing soon. The identification firmware 345 may be used alone (i.e., each block suspected of imminent failure may simply be assumed to be imminent). However, this approach may result in many blocks exiting the operation, and these blocks may still operate normally for a long time. The identification firmware 345 may be considered similar to a police apprehended suspect for a crime: the fact of arresting a suspect does not automatically imply that the suspect is criminal.
On the other hand, the verification firmware 350 may be considered similar to a criminal trial, returning a crime or innocent decision. The verification firmware 350 makes the final determination as to whether a particular block should actually be retired from use. Taking additional steps to verify that the block is actually ready to be retired may avoid premature retirement of the block.
It is also worth noting what computations are actually needed to both identify a block as a suspect block and verify that the block is ready to be retired. It may be identified as a suspect block by simply comparing the number of errors that have occurred in that block (relative to a threshold). This calculation is typically very fast and easy to perform, and the block can be analyzed without prior complex operations to see if the block is expected to fail soon, and thus can be ready to exit.
Verifying firmware 350 may involve more computations than identifying firmware 345. However, verification firmware 350 may be executed only after a block is identified as suspect. This condition prevents verification firmware 350 from repeatedly executing for many blocks, which, as described above, may exceed the available computing resources of SSD115 of fig. 1. Rather than continually checking each block to determine if any should be retired, verification firmware 350 is preferably invoked as needed for a single suspect block. Thus, the use of both the identification firmware 345 and the verification firmware 350 achieves the goal of providing fine-grained block failure prediction without imposing computational requirements to achieve fine-grained block failure prediction.
Fig. 7A-7B illustrate a flowchart of an example process of determining whether a block is expected to fail according to an embodiment of the inventive concept. In fig. 7A, at block 705, SSD115 of fig. 1 may track errors that have occurred in blocks 330 and 330 of fig. 3. At block 710, the SSD115 of fig. 1 may store the device-based log data 505 of fig. 5 at the metadata storage 340 of fig. 3. At block 715, the SSD115 of fig. 1 may discard the drive-based log data 505 of fig. 5 for the oldest error. If the drive-based log data 505 of FIG. 5 does not exist for the earliest error, block 715 may be omitted, as indicated by dashed line 720. At block 715, SSD115 of fig. 1 may store the precise block-based data 405 of fig. 4 at metadata storage 340 of fig. 3.
At block 730, SSD115 of fig. 1 may derive the approximate block-based data 605 of fig. 6. As discussed above with reference to fig. 6, the approximate block-based data 605 of fig. 6 may be derived by the identification firmware 345 of fig. 3, the verification firmware 350 of fig. 3, or some other component of the SSD115 of fig. 1 (e.g., by the SSD controller 310 of fig. 3).
At block 735, the identification firmware 345 of FIG. 3 may identify a block suspected of imminent failure. As discussed above with reference to fig. 6, the identification firmware 345 may identify blocks using the approximate block-based data 605 of fig. 6, the device-based log data 505 of fig. 5, or other data. At block 740, verification firmware 350 of FIG. 3 may verify whether the suspect block is in fact predicted to fail. As discussed above with reference to fig. 6, the verification firmware 350 of fig. 5 may make this determination using the approximate block-based data 605 of fig. 6, the exact block-based data 405 of fig. 4, the device-based log data 505 of fig. 5, or other data, and may perform this determination using any desired algorithm (e.g., a machine learning-based fault prediction model), which may use a random forest algorithm, a logarithmic regression algorithm, an outlier detection algorithm, an anomaly detection algorithm, or any other desired algorithm.
At block 745, the verification firmware 350 of FIG. 3 may determine whether the suspect block is in fact predicted to fail soon. If so, then at block 750, the verification firmware 350 of FIG. 3 may eliminate the suspect block. Culling a suspect block may include copying any valid data currently stored in the block to other blocks (and updating any tables identifying where the data is stored) and marking the block so that SSD115 of fig. 1 does not write any new data to the block. For example, the verification firmware 350 of FIG. 3 may mark each block as containing invalid data, but in some way prevent any garbage collection logic from selecting that block for garbage collection.
At this point, control may return to any of several points, regardless of whether the verification firmware 350 of FIG. 3 has evicted the suspect block. Control may return to block 705 to track that a new error has occurred in SSD115 of fig. 3, as indicated by dashed line 755. Alternatively, control may return to block 730 to scan SSD115 of fig. 3 for a new block suspected of imminent failure, as shown by dashed line 760. The former approach may be used in a system that scans for suspicious blocks after a predetermined number of errors have occurred. The latter approach may be used in systems that scan for suspicious blocks after a predetermined time interval has elapsed. Control may also be ended completely.
In fig. 7A-7B, some embodiments of the inventive concept are shown. However, those skilled in the art will recognize that other embodiments of the inventive concept are possible by changing the order of the blocks, by omitting blocks, or by including links not shown in the figures. All such variations of the flow diagrams, whether explicitly described or not, are considered embodiments of the inventive concept.
Embodiments of the inventive concept provide technical advantages over the prior art. First, embodiments of the inventive concept allow for fine-grained block failure prediction, which conventional systems cannot provide. Second, embodiments of the inventive concept avoid the possibility of a high false positive identification of a block predicted to fail by distinguishing between identification of a suspect block and verification that the suspect block is actually predicted to fail. Third, embodiments of the present inventive concept enable verification that a suspect block is predicted to fail without requiring the large amount of computing resources associated with such predictions in conventional systems. Fourth, embodiments of the inventive concept allow verifying whether a particular block is predicted to fail without having to determine whether other blocks are also predicted to fail, thereby minimizing the computational resources used.
The following discussion is intended to provide a general description of a suitable machine or machines in which certain aspects of the present inventive concepts may be implemented. One or more machines may be controlled, at least in part, by input from conventional input devices (e.g., keyboard, mouse, etc.) and by instructions received from another machine, interaction with a Virtual Reality (VR) environment, biometric feedback, or other input signals. As used herein, the term "machine" is intended to broadly encompass a single machine, a virtual machine, or a system of communicatively coupled machines, virtual machines, or devices operating together. Exemplary machines include computing devices such as personal computers, workstations, servers, portable computers, handheld devices, telephones, tablets, and the like, as well as transportation devices such as private or public transportation (e.g., automobiles, trains, taxis, and the like).
One or more machines may include embedded controllers, such as programmable or non-programmable logic devices or arrays, Application Specific Integrated Circuits (ASICs), embedded computers, smart cards, and the like. One or more machines may utilize one or more connections to one or more remote machines, such as through a network interface, modem, or other communicative coupling. The machines may be interconnected by a physical and/or logical network, such as an intranet, the internet, a local area network, a wide area network, etc. Those skilled in the art will appreciate that network communications may utilize a variety of wired and/or wireless short-range or long-range carriers and protocols, including Radio Frequency (RF), satellite, microwave, Institute of Electrical and Electronics Engineers (IEEE)802.11, RF,
Figure BDA0002739156110000171
optical, infrared, cable, laser, etc.
Embodiments of the inventive concepts may be described with reference to or in conjunction with associated data including functions, procedures, data structures, application programs, and the like, which when accessed by a machine results in the machine performing tasks or defining abstract data types or low-level hardware contexts. The associated data may be stored in, for example, volatile and/or non-volatile memory, such as RAM, ROM, etc., or other storage devices and their associated storage media, including hard disks, floppy disks, optical disk storage, tapes, flash memory, memory sticks, digital video disks, biological storage, etc. The associated data may be communicated through the transmission environment (including physical and/or logical networks) in the form of data packets, serial data, parallel data, propagated signals, etc., and may be used in a compressed or encrypted format. The relevant data may be used in a distributed environment and may be stored locally and/or remotely for access by the machine.
Embodiments of the inventive concepts may include a tangible, non-transitory, machine-readable medium comprising instructions executable by one or more processors, the instructions including instructions for performing elements of the inventive concepts as described herein.
The various operations of the methods described above may be performed by any suitable means capable of performing the operations, such as various hardware and/or software components, circuits, and/or modules. The software may comprise an ordered listing of executable instructions for implementing logical functions, and may be embodied in any "processor-readable medium" for use by or in connection with an instruction execution system, apparatus, or device, such as a single-core or multi-core processor or processor-containing system.
The blocks or steps of the methods, algorithms and functions described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a tangible, non-transitory computer-readable medium. A software module may reside in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), electrically programmable ROM (eprom), electrically erasable programmable ROM (eeprom), registers, hard disk, a removable disk, a CD ROM, or any other form of storage medium known in the art.
Having described and illustrated the principles of the inventive concept with reference to illustrated embodiments, it will be recognized that the illustrated embodiments can be modified in arrangement and detail without departing from such principles, and can be combined in any desired manner. Also, while the foregoing discussion has focused on particular embodiments, other configurations are contemplated. In particular, even though expressions such as "embodiments according to the inventive concept" are used herein, these phrases are intended to generally reference embodiment possibilities and are not intended to limit the inventive concept to particular embodiment configurations. As used herein, these terms may refer to the same or different embodiments that are combinable into other embodiments.
The foregoing illustrative embodiments should not be construed as limiting the inventive concepts thereof. Although a few embodiments have been described, those skilled in the art will readily appreciate that many modifications are possible in those embodiments without materially departing from the novel teachings and advantages of the present disclosure. Accordingly, all such modifications are intended to be included within the scope of the present inventive concept as defined in the claims.
Embodiments of the inventive concept can be extended to the following statements without limitation:
statement 1. embodiments of the inventive concept include a Solid State Drive (SSD) comprising:
a flash memory for data, the flash memory organized into a plurality of blocks;
a controller for managing reading and writing of data from and to the flash memory;
a metadata store to store device-based log data for errors in the SSD; and
an identification firmware, executable on the processor, to identify a suspect block of the plurality of blocks in response to the device-based log data.
Statement 2. embodiments of the inventive concept include an SSD according to statement 1, wherein the metadata memory stores device-based log data for only a most recent set of errors in the SSD.
Statement 3. embodiments of the inventive concept include SSDs according to statement 2, wherein the oldest entry in the device-based log data is discarded when a new error occurs.
Statement 4. embodiments of the inventive concept include an SSD according to statement 2, wherein:
the metadata memory is further operable to store precise block-based data regarding errors in the SSD; and
the SSD also includes verification firmware executable on the processor, the verification firmware operable to determine whether a suspect block is predicted to fail in response to the accurate block-based data and the device-based log data.
Statement 5. embodiments of the inventive concept include an SSD according to statement 4, wherein the verification firmware is executed only for suspect blocks.
Statement 6. embodiments of the inventive concept include an SSD according to statement 4, wherein the verification firmware is not performed for any of the plurality of blocks other than the suspect block.
Statement 7. embodiments of the inventive concept include an SSD according to statement 4, wherein the verification firmware is operable to cull suspect blocks in response to the accurate block-based data and the device-based log data.
Statement 8. embodiments of the inventive concept include an SSD according to statement 4, wherein the precise block-based data includes a counter for a number of errors per block in the plurality of blocks.
Statement 9. embodiments of the inventive concept include an SSD according to statement 8, wherein the counters for the number of errors in each of the plurality of blocks include a read error counter, a write error counter, and an erase error counter for each of the plurality of blocks.
Statement 10. embodiments of the inventive concept include an SSD according to statement 8, wherein the precise block-based data includes a counter of a number of errors per block in the plurality of blocks since the SSD was manufactured.
Statement 11. embodiments of the inventive concept include an SSD according to statement 4, wherein the verification firmware applies one of random forest, logistic regression, outlier detection analysis, and anomaly detection analysis to the accurate block-based data and the device-based log data.
Statement 12. embodiments of the inventive concept include an SSD according to statement 4, wherein the identification firmware is operable to identify the suspect block of the plurality of blocks in response to both the device-based log data and the accurate block-based data.
Statement 13. embodiments of the inventive concept include an SSD according to statement 2, wherein the identification firmware is operable to derive approximate block-based data from the device-based log data.
Statement 14. embodiments of the inventive concept include an SSD according to statement 13, wherein the identification firmware is operable to determine the approximate block-based data from the device-based log data as average block-based data.
Statement 15. embodiments of the inventive concept include an SSD according to statement 2, wherein the SSD is operable to periodically execute the identification firmware.
Statement 16. embodiments of the inventive concept include an SSD according to statement 15, wherein the SSD is operable to execute the identification firmware at regular intervals.
Statement 17 embodiments of the inventive concept include an SSD according to statement 15, wherein the SSD is operable to execute the identification firmware after a conventional number of errors have occurred.
Statement 18. embodiments of the inventive concept include a Solid State Drive (SSD) comprising:
a flash memory for data, the flash memory organized into a plurality of blocks;
a controller for managing reading and writing of data from and to the flash memory;
a metadata memory that can store precise block-based data for errors in the SSD; and
an identification firmware, executable on the processor, for identifying a suspect block of the plurality of blocks in response to the precise block-based data.
Statement 19. embodiments of the inventive concept include an SSD according to statement 18, wherein the identification firmware is operable to identify the suspect block in response to a total error count of the suspect block in the accurate block-based data.
Statement 20. embodiments of the inventive concept include an SSD according to statement 18, wherein the precise block-based data includes a counter for a number of errors for each of the plurality of blocks.
Statement 21. embodiments of the inventive concept include an SSD according to statement 20, wherein the counters for the number of errors in each of the plurality of blocks include a read error counter, a write error counter, and an erase error counter for each of the plurality of blocks.
Statement 22 embodiments of the inventive concept include an SSD according to statement 21, wherein the identification firmware is operable to calculate a total error count from the read error counter, the write error counter, and the erase error counter for the suspect block, and compare the total error counter to a threshold.
Statement 23. embodiments of the inventive concept include an SSD according to statement 18, wherein the SSD is operable to periodically execute the identification firmware.
Statement 24. embodiments of the inventive concept include an SSD according to statement 23, wherein the SSD is operable to execute the identification firmware at regular intervals.
Statement 25. embodiments of the inventive concept include an SSD according to statement 23, wherein the SSD is operable to execute the identification firmware after a conventional number of errors have occurred.
Statement 26. embodiments of the inventive concept include a method comprising:
tracking errors in a Solid State Drive (SSD), the SSD comprising a plurality of blocks;
storing device-based log data regarding errors in the SSD; and
in response to the device-based log data, a suspect chunk of the plurality of chunks is identified.
Statement 27. embodiments of the inventive concept include the method according to statement 26, wherein storing device-based log data regarding errors in the SSD includes storing device-based log data for only a most recent set of errors in the SSD.
Statement 28. embodiments of the inventive concept include the method according to statement 27, wherein storing the device-based log data about the error in the SSD further comprises: when a new error occurs in the SSD, the oldest entry in the device-based log data is discarded.
Statement 29 embodiments of the inventive concept include a method according to statement 27, further comprising:
storing precise block-based data regarding errors in the SSD; and
once a suspect block is identified, a determination is made whether the suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data.
Statement 30. embodiments of the inventive concept include a method according to statement 29, wherein determining whether a suspect block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises determining whether a suspect block is predicted to fail responsive to the accurate block-based data and the device-based log data only for the suspect block.
Statement 31 embodiments of the inventive concept include a method according to statement 29, wherein determining whether to predict the failure of the suspect block in response to both the accurate block-based data and the device-based log data comprises: it is not determined whether any other block is predicted to fail.
Statement 32. embodiments of the inventive concept include a method according to statement 29, further comprising culling suspect blocks based at least in part on the exact block-based data and the device-based log data.
Statement 33. embodiments of the inventive concept include a method according to statement 29, wherein storing precise block-based data about errors in the SSD includes storing a counter for a number of errors per block in the plurality of blocks.
Statement 34. embodiments of the inventive concept include the method according to statement 33, wherein storing a counter for a number of errors in each of the plurality of blocks comprises storing a read error counter, a write error counter, and an erase error counter for each of the plurality of blocks.
Statement 35. embodiments of the inventive concept include the method according to statement 33, wherein storing a counter of a number of errors for each block in the plurality of blocks comprises storing a counter of a number of errors for each block in the plurality of blocks since the SSD was manufactured.
Statement 36. embodiments of the inventive concept include a method according to statement 29, wherein determining whether to predict failure of a suspect block in response to both the accurate block-based data and the device-based log data comprises applying one of random forest, logistic regression, outlier detection analysis, and anomaly detection analysis to the accurate block-based data and the device-based log data.
Statement 37. embodiments of the inventive concept include a method according to statement 29, wherein determining whether a suspect block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises determining whether a suspect block is predicted to fail responsive to spatial locality information of the suspect block.
Statement 38. embodiments of the inventive concept include the method according to statement 27, wherein identifying suspect blocks of the plurality of blocks in response to the device-based log data comprises deriving approximate block-based data from the device-based log data.
Statement 39. embodiments of the inventive concept include the method according to statement 38, wherein deriving the approximate block-based data from the device-based log data comprises determining an average block-based data from the device-based log data.
Statement 40 embodiments of the inventive concept include a method according to statement 27, further comprising: new suspect blocks of the plurality of blocks are periodically identified in response to the device-based log data.
Statement 41 embodiments of the inventive concept include a method according to statement 40, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data includes checking the suspect blocks in the plurality of blocks at regular time intervals.
Statement 42. embodiments of the inventive concept include a method according to statement 40, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data includes checking suspect blocks in the plurality of blocks after a regular number of errors have occurred.
Statement 43 embodiments of the inventive concept include a method comprising:
tracking errors in a Solid State Drive (SSD), the SSD comprising a plurality of blocks;
storing precise block-based data about errors in the SSD; and
in response to the accurate block-based data, a suspect block of the plurality of blocks is identified.
Statement 44. embodiments of the inventive concept include the method according to statement 43, wherein identifying suspect blocks of the plurality of blocks in response to the accurate block-based data comprises:
calculating a total error count for the suspect block from the accurate block-based data; and
the total error count is compared to a threshold error count.
Statement 45 embodiments of the inventive concept include a method according to statement 44, wherein calculating a total error count for the suspect block from the accurate block-based data comprises:
determining a read error counter, a write error counter, and an erase error counter for the suspect block from the accurate block base data; and
the read error counter, write error counter, and erase error counter are summed to calculate a total error count for the suspect block.
Statement 46. embodiments of the inventive concept include a method according to statement 43, further comprising periodically identifying a new suspect block in the plurality of blocks in response to the device-based log data.
Statement 47. embodiments of the inventive concept include a method according to statement 46, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data includes checking the suspect blocks in the plurality of blocks at regular intervals.
Statement 48 embodiments of the inventive concept include a method according to statement 46, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data includes checking suspect blocks in the plurality of blocks after a regular number of errors have occurred.
Statement 49. embodiments of the inventive concept include an article of manufacture comprising a non-transitory storage medium having stored thereon instructions that, when executed by a machine, result in:
tracking errors in a Solid State Drive (SSD), the SSD comprising a plurality of blocks;
storing device-based log data regarding errors in the SSD; and
in response to the device-based log data, a suspect chunk of the plurality of chunks is identified.
Statement 50. embodiments of the inventive concept include the product according to statement 49, wherein storing device-based log data regarding errors in the SSD includes storing device-based log data for only a most recent set of errors in the SSD.
Statement 51 embodiments of the inventive concept include an article according to statement 50, wherein storing device-based log data regarding errors in the SSD further comprises: when a new error occurs in the SSD, the oldest entry in the device-based log data is discarded.
Statement 52. embodiments of the inventive concept include an article of manufacture according to statement 50, wherein the non-transitory storage medium has stored thereon further instructions that, when executed by a machine, result in:
storing precise block-based data regarding errors in the SSD; and
once a suspect block is identified, a determination is made whether the suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data.
Statement 53. embodiments of the inventive concept include the product according to statement 52, wherein determining whether the suspect block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises determining whether the suspect block is predicted to fail responsive to the accurate block-based data and the device-based log data only for the suspect block.
Statement 54. embodiments of the inventive concept include products according to statement 52, wherein determining whether to predict the failure of the suspect block in response to both the accurate block-based data and the device-based log data comprises: it is not determined whether any other block is predicted to fail.
Statement 55. embodiments of the inventive concept include an article of manufacture according to statement 52, wherein the non-transitory storage medium has stored thereon further instructions that, when executed by the machine, result in culling the suspect block based at least in part on the exact block-based data and the device-based log data.
Statement 56. embodiments of the inventive concept include an article in accordance with statement 52, wherein storing precise block-based data regarding errors in the SSD includes storing a counter for a number of errors for each of the plurality of blocks.
Statement 57. embodiments of the inventive concept include an article according to statement 56, wherein storing a counter for a number of errors per block in the plurality of blocks comprises storing a read error counter, a write error counter, and an erase error counter for each block in the plurality of blocks.
Statement 58. embodiments of the inventive concept include an article according to statement 56, wherein storing a counter for a number of errors in each of the plurality of blocks comprises storing a counter for a number of errors in each of the plurality of blocks since the SSD was manufactured.
Statement 59 embodiments of the inventive concept include an article according to statement 52, wherein determining whether to predict failure of a suspect block in response to both the accurate block-based data and the device-based log data comprises applying one of random forest, logistic regression, outlier detection analysis, and anomaly detection analysis to the accurate block-based data and the device-based log data.
Statement 60 embodiments of the inventive concept include the product according to statement 52, wherein determining whether the suspect block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises determining whether the suspect block is predicted to fail responsive to the spatial locality information of the suspect block.
Statement 61 embodiments of the inventive concept include an article of manufacture according to statement 50, wherein identifying suspect blocks of the plurality of blocks in response to the device-based log data comprises deriving approximate block-based data from the device-based log data.
Statement 62. embodiments of the inventive concept include the product of statement 61, wherein deriving the approximate block-based data from the device-based log data comprises determining an average block-based data from the device-based log data.
Statement 63. embodiments of the inventive concept include an article of manufacture according to statement 50, wherein the non-transitory storage medium has stored thereon further instructions that, when executed by the machine, result in periodically identifying a new suspect block of the plurality of blocks in response to the device-based log data.
Statement 64 embodiments of the inventive concept include the product according to statement 63, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data comprises checking the suspect blocks in the plurality of blocks at regular time intervals.
Statement 65. embodiments of the inventive concept include the product of statement 63, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data comprises checking suspect blocks in the plurality of blocks after a regular number of errors have occurred.
Statement 66 embodiments of the inventive concept include an article of manufacture comprising a non-transitory storage medium having instructions stored thereon that, when executed by a machine, result in:
tracking errors in a Solid State Drive (SSD), the SSD comprising a plurality of blocks;
storing precise block-based data about errors in the SSD; and
in response to the accurate block-based data, a suspect block of the plurality of blocks is identified.
Statement 67. embodiments of the inventive concept include products according to statement 66, wherein identifying suspect blocks of the plurality of blocks in response to the accurate block-based data comprises:
calculating a total error count for the suspect block from the accurate block-based data; and
the total error count is compared to a threshold error count.
Statement 68. embodiments of the inventive concept include products according to statement 67, wherein calculating a total error count for the suspect block from the accurate block-based data comprises:
determining a read error counter, a write error counter, and an erase error counter for the suspect block from the accurate block base data; and
the read error counter, write error counter, and erase error counter are summed to calculate a total error count for the suspect block.
Statement 69. embodiments of the inventive concept include an article according to statement 66, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, result in periodically identifying a new suspect block of the plurality of blocks in response to the device-based log data.
Statement 70. embodiments of the inventive concept include a product according to statement 69, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data comprises checking the suspect blocks in the plurality of blocks at regular intervals.
Statement 71. embodiments of the inventive concept include the product of statement 69, wherein periodically identifying new suspect blocks in the plurality of blocks in response to the device-based log data comprises checking the suspect blocks in the plurality of blocks after a regular number of errors have occurred.
Accordingly, in view of the various arrangements of embodiments described herein, this detailed description and accompanying materials are intended to be illustrative only and should not be taken as limiting the scope of the inventive concepts. Accordingly, the claimed invention is intended to embrace all such modifications as fall within the scope and spirit of the appended claims and equivalents thereof.

Claims (20)

1. A Solid State Drive (SSD), comprising:
a flash memory for data, the flash memory organized into a plurality of blocks;
a controller for managing reading and writing of data from and to the flash memory;
a metadata storage for storing device-based log data to prevent errors in the SSD; and
an identification firmware, executing on the processor, operable to identify a suspect block of the plurality of blocks in response to the device-based log data.
2. The SSD of claim 1, wherein the metadata store stores device-based log data only for a latest set of errors in the SSD.
3. The SSD of claim 2, wherein:
the metadata memory is further operable to store precise block-based data regarding errors in the SSD; and
the SSD also includes verification firmware executing on the processor, the verification firmware operable to determine whether the suspect block is predicted to fail in response to the exact block based data and the device based log data.
4. The SSD of claim 3, wherein the verification firmware is executed only on the suspect block.
5. The SSD of claim 3, wherein the verification firmware is operable to roll back the suspect block in response to the precise block-based data and the device-based log data.
6. The SSD of claim 3, wherein the validation firmware implements one of random forest, logistic regression, outlier detection analysis, and anomaly detection analysis on the precise block-based data and the device-based log data.
7. The SSD of claim 2, wherein the identification firmware is operable to derive the approximate block-based data from the device-based log data.
8. The SSD of claim 2, wherein the SSD is operable to periodically execute the identification firmware.
9. A method, comprising:
tracking errors in a Solid State Drive (SSD), the SSD comprising a plurality of blocks;
storing device-based log data regarding the error in the SSD; and
in response to the device-based log data, a suspect chunk of the plurality of chunks is identified.
10. The method of claim 9, wherein storing device-based log data regarding errors in an SSD comprises storing device-based log data only for a most recent set of errors in the SSD.
11. The method of claim 10, further comprising:
storing the precise block-based data about the error in the SSD; and
once a suspect block is identified, it is determined whether the suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data.
12. The method of claim 11, wherein determining whether a suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data comprises: determining whether a suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data only for the suspect block.
13. The method of claim 11, further comprising: returning the suspect block based at least in part on the precise block-based data and on log data of the device.
14. The method of claim 11, wherein determining whether a suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data comprises performing one of random forest, logistic regression, outlier detection analysis, and anomaly detection analysis on the accurate block-based data and the device-based log data.
15. The method of claim 10, wherein identifying suspect blocks of the plurality of blocks in response to the device-based log data comprises deriving approximate block-based data from the device-based log data.
16. The method of claim 10, further comprising: periodically, a new suspect block of the plurality of blocks is identified in response to the device-based log data.
17. An article comprising a non-transitory storage medium having stored thereon instructions that when executed by a machine result in:
tracking errors in a Solid State Drive (SSD), the SSD comprising a plurality of blocks;
storing device-based log data regarding the error in the SSD; and
in response to the device-based log data, a suspect chunk of the plurality of chunks is identified.
18. The product of claim 17, wherein storing device-based log data regarding errors in an SSD comprises storing device-based log data only for a most recent set of errors in the SSD.
19. The article of claim 18, wherein the non-transitory storage medium has stored thereon further instructions that, when executed by the machine, result in:
storing the precise block-based data about the error in the SSD; and
once a suspect block is identified, it is determined whether the suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data.
20. The article of manufacture of claim 19, determining whether a suspect block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises: determining whether a suspect block is predicted to fail in response to both the accurate block-based data and the device-based log data only for the suspect block.
CN202011144198.2A 2019-10-25 2020-10-23 Firmware-based solid state drive block failure prediction and avoidance scheme Pending CN112711492A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962926420P 2019-10-25 2019-10-25
US62/926,420 2019-10-25
US16/701,133 US11567670B2 (en) 2019-10-25 2019-12-02 Firmware-based SSD block failure prediction and avoidance scheme
US16/701,133 2019-12-02

Publications (1)

Publication Number Publication Date
CN112711492A true CN112711492A (en) 2021-04-27

Family

ID=75542971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011144198.2A Pending CN112711492A (en) 2019-10-25 2020-10-23 Firmware-based solid state drive block failure prediction and avoidance scheme

Country Status (1)

Country Link
CN (1) CN112711492A (en)

Similar Documents

Publication Publication Date Title
US11567670B2 (en) Firmware-based SSD block failure prediction and avoidance scheme
EP3125120B1 (en) System and method for consistency verification of replicated data in a recovery system
EP3098715B1 (en) System and method for object-based continuous data protection
Mahdisoltani et al. Proactive error prediction to improve storage system reliability
US11500752B2 (en) Multi-non-volatile memory solid state drive block-level failure prediction with separate log per non-volatile memory
US8122185B2 (en) Systems and methods for measuring the useful life of solid-state storage devices
CN103890724B (en) Information processing apparatus, method for controlling information processing apparatus, host device, and performance evaluation method used for external storage device
US20130339569A1 (en) Storage System and Method for Operating Thereof
US9063662B1 (en) Method and system for monitoring disk reliability with global disk scrubbing
US10481988B2 (en) System and method for consistency verification of replicated data in a recovery system
US9535779B1 (en) Method and system for predicting redundant array of independent disks (RAID) vulnerability
US11734103B2 (en) Behavior-driven die management on solid-state drives
KR102031606B1 (en) Versioned memory implementation
CN110321067B (en) System and method for estimating and managing storage device degradation
US11768701B2 (en) Exception analysis for data storage devices
CN112650446A (en) Intelligent storage method, device and equipment of NVMe full flash memory system
CN101752008B (en) Method for testing reliability of solid-state storage media
CN112711492A (en) Firmware-based solid state drive block failure prediction and avoidance scheme
US10310948B2 (en) Evaluation of risk of data loss and backup procedures
Vishwakarma et al. Enterprise Disk Drive Scrubbing Based on Mondrian Conformal Predictors
Missimer Exploiting solid state drive parallelism for real-time flash storage
EP3547139A1 (en) System and method of assessing and managing storage device degradation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination