CN116893985A

CN116893985A - System and method for pre-populating an address translation cache

Info

Publication number: CN116893985A
Application number: CN202310375755.9A
Authority: CN
Inventors: D·L·赫尔米克; V·K·阿格拉沃尔
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2022-04-11
Filing date: 2023-04-10
Publication date: 2023-10-17

Abstract

Systems and methods for processing commands from a host computing device to a storage device are disclosed. The method comprises the following steps: identifying, by the storage device, a command from the host computing device, the command including a logical address; detecting conditions; requesting, by the storage device, a conversion of the logical address to a physical address based on the detected condition; storing, by the storage device, the physical address in the cache; and transmitting data according to the command based on the physical address.

Description

System and method for pre-populating an address translation cache

Cross Reference to Related Applications

The priority and benefit of U.S. provisional application No. 63/329,755 entitled "JUST IN TIME ATC PRE-POPULATION FOR READS," filed 11 at 4 at 2022, is claimed herein, the entire contents of which are incorporated herein in their entirety.

Technical Field

One or more aspects in accordance with embodiments of the present disclosure relate to managing memory, and more particularly to managing utilization of a cache memory storing physical memory addresses associated with virtual memory addresses.

Background

The host device may interact with the storage device in one or more virtual memory address spaces. Shared memory for various purposes such as submitting and completing commands or delivering or receiving data may require a device to translate a virtual address of a host to a physical address so that the data may be properly located. The read and write requests may include virtual memory addresses. The virtual memory address is translated to a physical memory address by a translation agent, and the translated physical address may be stored in an Address Translation Cache (ATC) of the storage device. Cache space is typically limited. Thus, it may be desirable to populate physical memory addresses to efficiently and effectively use ATCs.

The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and, therefore, it may contain information that does not form the prior art.

Disclosure of Invention

Embodiments of the present disclosure relate to a method for processing commands from a host computing device to a storage device. The method comprises the following steps: identifying, by the storage device, a command from the host computing device, the command including a logical address; detecting conditions; requesting, by the storage device, a conversion of the logical address to a physical address based on the detected condition; storing, by the storage device, the physical address in the cache; and transmitting data according to the command based on the physical address.

According to one embodiment, the method comprises: a first timer is started by the storage device in response to identifying the command, wherein the condition includes detecting expiration of the first timer.

According to one embodiment, the length of the first timer is shorter than the predicted delay of the processing command. The length of the first timer may be dynamically calculated based on the number of active commands to be processed.

According to one embodiment, the method further comprises: setting a second timer by the storage device; detecting expiration of the second timer; the logical address is requested to be translated to a physical address by the storage device based on expiration of a second timer, wherein a length of the second timer is based on an expected length of the completion event during processing of the command by the storage device.

According to one embodiment, the event includes invoking an error recovery action.

According to one embodiment, the command includes a command to read data from the storage device, wherein the logical address is to store the data in a memory location of the host computing device.

According to one embodiment, the method further comprises: the progress of execution of the command is monitored, wherein detecting the condition includes determining that a milestone (milestone) has been reached.

According to one embodiment, monitoring includes monitoring a plurality of steps performed for a command.

According to one embodiment, the milestone includes completing sensing of signals from one or more memory cells of the storage device.

Embodiments of the present disclosure also relate to a storage device including a cache; and a processor coupled to the cache, the processor configured to execute logic that causes the processor to: identifying a command from the host computing device, the command including a logical address; detecting conditions; requesting conversion of the logical address to a physical address based on the detected condition; storing the physical address in a cache; and transmitting data according to the command based on the physical address.

Embodiments of the present disclosure also relate to a method for processing commands from a host computing device to a storage device, the method comprising: identifying, by the storage device, a plurality of write commands from the host computing device; storing, by the storage device, the plurality of write commands in a queue of the storage device; selecting, by the storage device, a write command from the queue; identifying, by the storage device, a logical address of the write command selected from the queue; requesting, by the storage device, conversion of the logical address to a physical address; storing, by the storage device, the physical address in the cache; and transmitting data according to the write command selected from the queue based on the physical address.

According to one embodiment, the method further comprises determining that the write buffer is full, wherein storing the write command in the queue is in response to determining that the write buffer is full.

Embodiments of the present disclosure also relate to a method for processing commands from a host computing device to a storage device, the method comprising: identifying, by the storage device, a write command from the host computing device, wherein the write command includes a first logical address; processing the first data transfer based on the first logical address, wherein processing the first data transfer comprises: identifying a second logical address; requesting, by the storage device, conversion of the second logical address to a physical address; storing, by the storage device, the physical address in the cache; and processing the second data transfer based on the second logical address.

According to one embodiment, processing the second data transfer includes: retrieving the physical address from the cache; and transferring the data from the physical address to a location in the storage device.

According to one embodiment, the first logical address is stored in a first data structure and the second logical address is stored in a second data structure.

As will be appreciated by those skilled in the art, embodiments of the present disclosure provide for the efficiency of processing commands from a host computing device.

These and other features, aspects, and advantages of the embodiments of the present disclosure will become more fully understood when considered in connection with the following detailed description, appended claims, and accompanying drawings. The actual scope of the invention is, of course, defined in the appended claims.

Drawings

Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified.

FIG. 1 is a block diagram of a system for pre-populating an address cache, according to one embodiment;

FIG. 2 is a block diagram of a storage device according to one embodiment;

3A-3B are flowcharts of a read flow according to one embodiment;

FIG. 4 is a flowchart of a process for pre-populating an Address Translation Cache (ATC) based on a timer, according to one embodiment;

FIG. 5 is a flow diagram of a process for populating address translations in ATCs, according to one embodiment;

FIG. 6 is a flow diagram of a process for pre-populating an ATC based on at least two timers, according to one embodiment;

FIG. 7 is a flowchart of a process for pre-populating an ATC based on monitoring the progress of a read process flow, in accordance with one embodiment;

FIG. 8 is a flowchart of a process for pre-populating ATCs in a relatively high queue depth environment in which multiple write commands can be processed at a given time, according to one embodiment; and

FIG. 9 is a flowchart of a process for pre-populating an ATC for write commands that invoke a relatively long list of memory locations, according to one embodiment.

Detailed Description

Hereinafter, example embodiments will be described in more detail with reference to the drawings, wherein like reference numerals denote like elements throughout. This disclosure may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided as examples so that this disclosure will be thorough and complete, and will fully convey the aspects and features of the disclosure to those skilled in the art. Thus, processes, elements, and techniques not necessary for a complete understanding of aspects and features of the present disclosure by those of ordinary skill in the art may not be described. Unless otherwise indicated, like reference numerals designate like elements throughout the drawings and written description, and thus, the description thereof may not be repeated. In addition, in the drawings, the relative sizes of elements, layers and regions may be exaggerated and/or simplified for clarity.

The host device may interact with the storage device in one or more virtual memory address spaces. For example, shared memory may be used to submit and complete I/O commands such as read or write commands. Communication between the host device and the storage device may be via an interface (e.g., connector and its protocols), such as the non-volatile memory express (NVMe) protocol, the serial attached small computer system interface (SAS) protocol, the Serial Advanced Technology Attachment (SATA) protocol, and so forth.

In one embodiment, an I/O command from a host identifies a virtual address. For example, the virtual address of the read command identifies a location in host memory where data read from the storage device is to be stored, and the virtual address of the write command identifies a location in host memory where data written to the storage device is to be retrieved. Memory reads of host memory by a storage device for performing read and write commands may be Direct Memory Accesses (DMAs) that bypass a Central Processing Unit (CPU) of the host.

In one embodiment, the virtual address in the read/write command is translated by a Translation Agent (TA) to a physical memory address of host memory. Once translated, the physical memory address is stored in an Address Translation Cache (ATC) of the storage device for use by DMA read and write operations. The controller of the storage device may check whether the ATC has a physical memory address corresponding to a virtual address in the read/write command. If there is a physical memory address in the ATC that corresponds to a virtual address in the read/write command, the controller may use the translated physical memory address to access host memory.

In one embodiment, the virtual address translated to the physical address is stored in the ATC of the storage device. In order for the controller of the storage device to access host memory in response to a read or write request, the controller checks the ATC to determine if there is a virtual to physical address translation. If translation is present, the controller uses the translated physical address for memory reads and no separate request for address translation needs to be issued to the host.

Address translation of virtual memory addresses to physical memory addresses may add latency to read and/or write operations. Memory reads by the storage device in the virtualized addressing environment must wait until address translation is completed before the memory location is accessed. The storage device must request Address Translation (AT) from the TA. The TA must identify the correct translation from the Address Translation and Protection Table (ATPT) and the TA must provide a response to the address translation request. Since the read or write data will be maintained in the virtual address space by the host, the process of translating virtual addresses may block the flow of read/write commands.

In the case where the virtual address does not exist in the ATPT, the delay is longer. As previously described, the storage device must request the AT from the TA. The TA will not find the correct transition in the ATPT. The TA will respond to the original AT with a failure. In case of failure, the device will send a Paging Request Interface (PRI) to the TA. The TA will communicate with the rest of the host system in a vendor specific manner to populate the virtual to physical translation into the ATPT. The TA will eventually send a completion response to the device's PRI request. The device will send the second AT to the TA and the TA will now be able to find the virtual address in the ATPT. Making it an event that further blocks activity to access the memory location.

Once the translated address is stored in the ATC, the length of time it remains in the ATC prior to use during DMA may depend on the size of the ATC and/or the number of active read/write commands that the storage device may handle at a given time. For example, in a relatively high Queue Depth (QD) environment with a relatively high number of active commands submitted by a host to a commit queue (hereinafter referred to as commit queue entry (SQE)), the ATC may be full when requests are being processed, such that older translated addresses of older requests are evicted from the ATC to make room for newer translated addresses of newer requests. Another common use of ATCs is to speculatively store previously used address translations if they are needed again. Both the higher rate of ATC fill requests and the increased desire to maintain recently useful address translations may result in an eviction rate from the ATC that is higher than optimal. When SSDs require access to host memory, higher eviction rates may be associated with a reduced probability of the presence of the required address translation in the ATC. Larger size ATCs may help solve this problem, but increased ATC sizes occupy a larger area of Static Random Access Memory (SRAM) storing ATCs, resulting in increased cost and increased power usage.

Embodiments of the present disclosure relate to systems and methods for coordinating early padding (also referred to as pre-padding) of ATCs to avoid address translation requests (and PRI requests) becoming blocking activity during direct memory access. In one embodiment, a timer is employed to trigger address translation requests. The timer may be set to expire after receiving and parsing a SQE containing an input/output (I/O) command (e.g., a read command) but before execution of the command is terminated. Expiration of the timer may trigger sending an address translation request for the virtual address included in the I/O command. The timer may be set such that the translated address is inserted into the ATC substantially close in time to its use by the controller to perform memory accesses in accordance with the I/O commands.

In one embodiment, the address translation request is triggered based on the detected progress of processing an I/O command (such as a read command as an example) in the resolved SQE. Monitoring the progress of the I/O command flow may allow the timing of submitting the address translation request to be dynamic and satisfy the progress of the I/O command flow. For example, if the I/O command flow encounters blocking activity quite early in the command flow, the translation of the memory address may not be triggered until the command flow progresses to reach a milestone later in the command flow.

For a read command, one or more steps of the command flow that may be monitored by the controller to determine whether an address translation request should be submitted may generally include: 1) Receiving, by a controller of a storage device, the SQE; 2) Parsing the command to identify one or more physical locations in the storage device where data is to be retrieved; 3) Reading data from the medium (e.g., read sensing of data stored in the NAND die); 4) Transmitting data from the medium to the controller; 5) Error Correction Code (ECC) decoding is performed to check and/or remove any potential accumulated errors; and 6) direct memory access to host memory to provide read data. In one embodiment, the controller monitors the progress of the step of reading the command and initiates an address translation request to the host after detecting that the NAND has completed reading the sense and before transferring the sense data from the NAND to the controller. This may allow the translated address to exist in the ATC when step 5 is performed, and the read command may not be delayed.

In one embodiment, the address translation request is initiated in response to expiration of a plurality of timers. For example, the first timer may be set as described above to initiate address translation before the read command is completed. In addition to the first timer, a second timer may be set (e.g., simultaneously with the setting of the first timer). The second timer may be set to account for additional latency that may be introduced during execution of the read command.

In one embodiment, additional delay is introduced by the detected event. The additional delay may be predictable. The second timer may be set based on a predicted/expected time delay. For example, the length of the second timer may be set shorter than the predicted delay. The request may be address converted to be triggered when the second timer expires. In this way, even if the translated address obtained after expiration of the first timer is evicted from the ATC (e.g., due to additional latency introduced by the detected event), use of the second timer may allow the address to be timely refilled into the ATC for use in memory access in accordance with the read command.

In one embodiment, the values of the first timer and/or the second timer are fixed preset values and they begin at the time of receipt and resolution of the SQE. In one embodiment, the values of the first timer and/or the second timer are dynamically determined before being set. In one embodiment, the value of the timer is a function of read QDs and/or write QDs, where read QDs represent the number of active read commands in the commit queue that can be processed by the storage device at a given time, and write QDs represent the number of active write commands in the commit queue. In one embodiment, the length of the second timer is greater than the length of the first timer. In some embodiments, the storage device may monitor the average time to complete the read sensing and re-evaluate the first timer and/or the second timer. In some embodiments, the average time may be associated with a particular condition detected by the storage device, and the first timer and/or the second timer may be set based on the average time.

Write commands may also benefit from ATC pre-population. For example, pre-filling of ATC for write commands may be beneficial in a fairly high QD environment (e.g., QD > 50). In a fairly high QD environment, the controller may fill the internal write command queue with active write commands extracted from the commit queue while waiting for some controller resources (e.g., a write buffer that holds the write data before programming to NAND) to become available. Write commands from the internal write command queue may be processed one at a time in a first-in first-out manner. In this case, address translation requests may be started for X out of N total write commands in the internal write queue, and the translated address will be placed in the ATC. As the write command continues through the controller, address translation for the write command may have been completed and the translated address will be present in the ATC.

According to one embodiment, the pre-population of ATCs may also help enable more efficient processing of write commands that invoke a relatively long list of memory locations. For example, the SQE for large write commands may use a linked list of data structures identified via Physical Region Page (PRP) entries or Scatter Gather List (SGL) segments. Each of these PRP or SGL SQE command structures may be placed in a very diverse set of host memory locations, all of which require address translation. In one embodiment, address translations for one or more of the upcoming memory addresses may be requested while data transfers and processing of the current memory location are being processed. In this way, the translations may be filled in the ATC before the memory address is data transferred.

FIG. 1 is a block diagram of a system for pre-populating an address cache, according to one embodiment. The system includes a host computing device 100 coupled to one or more endpoints, such as, for example, one or more storage devices 102a-102c (collectively 102). Hereinafter, the terms endpoint and storage device 102 will be used interchangeably.

Endpoints may communicate with host device 100 through a network fabric such as, for example, a Peripheral Component Interconnect (PCI) or PCI express (PCIe) bus. Thus, the endpoints in the illustrated example may also be referred to as PCIe devices. In some cases, the endpoints may also include endpoints 101 integrated into host device 100. NVMe is a protocol that is typically carried over PCIe. In some embodiments, endpoints may communicate over communication links other than PCIe, including, for example, computing fast link (CXL), NVMe over fabric, serial attached small computer system interface (SAS), serial Advanced Technology Attachment (SATA), cache coherent interconnect for accelerators (CCIX), and the like.

The host device 100 may write data to and read data from the storage device 102 over a PCIe fabric. The storage device 102 may perform direct access to the local memory 104 of the host device 100 while processing read and write commands. Direct memory access may allow data to be transferred to and from host memory 104 without involving software of processor 105 of host device 100.

In one embodiment, the host device 100 further includes a memory management unit 106, which may include a Translation Agent (TA) 108 and an Address Translation and Protection Table (ATPT) 110. The TA108 may provide address translation services to endpoints (e.g., the storage device 102) to translate virtual addresses to real physical addresses of the host memory 104. In one embodiment, the TA108 may retrieve a physical address corresponding to the virtual address from the ATPT 110.

The host device 100 may also include an interface 112, such as a PCIe interface. The PCIe interface may implement a Root Complex (RC) for connecting the processor 105 and the host memory 104 to the PCIe fabric. The interface 112 may include one or more ports 114 to connect one or more endpoints (e.g., the storage device 102) to the RC 112. In some cases, the endpoint may be coupled to a switch 116, and the switch may then be connected to RC 112 via one of ports 114. In some embodiments, TA108 may be integrated into RC 112.

In one embodiment, a message sent from the storage device 102 to the TA108, such as, for example, a request for address translation, is delivered to the RC 112 over the PCIe fabric, which RC 112 in turn delivers the request to the TA 108. Messages sent from the TA108 to the storage device 102, such as, for example, a response to a request from the storage device 102, are delivered from the TA108 to the RC 112, and the RC 112 sends the message to the storage device 102 over the PCIe fabric.

In one embodiment, at least one storage device 102 includes an Address Translation Cache (ATC) 118a, 118b, or 118c (collectively 118) for storing mappings between virtual (untranslated) addresses and physical (translated) addresses. When the storage device 102 receives a read/write command with a virtual address from the host 100, the storage device 102 may examine the local ATC 118 to determine if the cache already contains the translated address. If the ATC 118 already contains the translated address, the storage device 102 may effectively access the host memory 104 at the physical address without involving the RC 112 and TA 108. If the ATC 118 does not contain a translated address, the storage device 102 may send an address translation request to the TA 118 with the virtual address (or address range) to translate.

FIG. 2 is a block diagram of a storage device 102 according to one embodiment. The storage device 102 may include a communication interface 200, a device controller 202, an internal memory 204, and a non-volatile memory (NVM) medium 206. The communication interface 200 may include PCIe ports and endpoints that implement a communication portal from the host 100 to the storage device 102, and a communication portal from the storage device 102 to the host 100. In one embodiment, the communication interface 200 stores the ATC 118 with one or more virtual address to physical address mappings.

In one embodiment, the device controller 202 executes commands requested by the host 100, such as read and write commands, for example. The device controller 202 may include, but is not limited to, one or more processors 208 and a media interface 210. The one or more processors 208 may be configured to execute computer-readable instructions for processing commands from the host 100 and for managing the operation of the storage device 102. The computer readable instructions executed by the one or more processors 208 may be, for example, firmware code.

In one example, the one or more processors 208 can be configured to process write or read commands to or from the NVM media 206. The one or more processors 140 may interact with the NVM media 206 via the media interface 210 for implementing write or read actions. NVM media 206 may include one or more types of non-volatile memory, such as, for example, flash memory, NAND, reRAM, PCM, MRAM. The storage device may also be a HDD with a different internal controller architecture.

In one embodiment, the internal memory 204 is configured as short term storage or temporary memory during operation of the storage device 104. The internal memory 138 may include DRAM (dynamic random access memory), SRAM (static random access memory), and/or DTCM (data tightly coupled memory). The internal memory 138 may also include buffers, such as a read buffer 212 and a write buffer 214, for temporarily storing data transferred to and from the host memory 104 while processing read and write commands. The internal memory 138 may also include an internal write queue 216 for storing write commands received from the host 100 that may not be immediately processed due to, for example, lack of space in the write buffer 214. The internal write queue 216 may be, for example, a first-in-first-out (FIFO) queue.

In one embodiment, the controller 202 is configured to identify/receive (e.g., via an extraction action) from the host 100 a SQE with read/write commands. The number of read/write commands (commands in flight) provided by the host to the controller 202 at a time may be referred to as the Queue Depth (QD) of the commands. In a low QD environment (e.g., qd=1, one outstanding command is provided to the drive at a time by the host), the latency of processing the command (e.g., read command) may be predictable. For a low read QD environment, controller 202 may set a timer to trigger transmission of a request for address translation via communication interface 200 when the timer expires. In one embodiment (e.g., with a substantial ATC 118), the controller 202 may set the length of the timer to 0. A timer with a value of 0 may cause a request for address translation to be sent without any delay when a read command is received by the controller 202. This implementation takes advantage of low QDs by assuming low ATC utilization. Low ATC utilization reduces the probability that a cache entry will be evicted when a transition is required.

In one embodiment, the controller 202 sets the length of the timer to be shorter than the expected read latency. For example, if the predicted read delay is 50 microseconds, the timer may be set to expire at 45 microseconds. The prediction may be the result of early characterization or modeling work, or it may be the result of ongoing driving measurements. The address translation request may be sent from the device to the host at the expiration of 45 microseconds, and the host may complete the address translation in time, e.g., at approximately 50 microseconds, for performing a direct memory access at the translated address.

In one embodiment, the length of the timer varies as a function of QDs for read commands and/or write commands. The controller 202 may calculate the length of the timer prior to setting the timer based on the environment in which the timer is deployed. In one exemplary function, the higher the QD, the longer the length of the timer. In one embodiment, the length of the timer varies as a function of one or more of: the number of NVM media 206 die, operation time on die, operation type on die, time in controller 202, resource conflict in controller 202, etc.

In one embodiment, the controller 202 may set a second timer in addition to the first timer for triggering a second address translation. For example, the second timer may be set based on an event (hereinafter referred to as an expected event) that can occur during the reading process. The event may be, for example, invoking an action other than a normal read flow, such as, for example, a second stage error recovery action. The recovery action may increase latency of the read flow, which may cause the translated address to be evicted from ATC 118. In this case, the controller 202 may send an address translation request via the communication interface 200 to put the translated address back into the ATC 118 when the second timer expires.

In one embodiment, the risk of address translations being evicted from the ATC may be high and the cost of checking for the presence of address translations in the ATC may be low. Thus, the first timer will proceed as described above. There will be a second timer set based on the predicted/expected delay of the secondary expected event. For example, the length of the second timer may be equal to the length of the read (50 us as described in the previous example) plus the length of time that the reconstruction of data from the other drives failed to read. For example, some SSDs may read several other locations and perform RAID recovery on the lost data. The length of time for RAID recovery and the length of time for normal reading will be added together to set the value of the second timer. This means that the first timer may have triggered address translation, but the address translation may have been evicted from the ATC. In the event that it is evicted, the second timer will trigger a new request for an address in the ATC. In the case where the address translation is still in the ATC, the second request for address translation will immediately be successfully returned because the address translation is present in the cache.

In one embodiment, the timing of address translation requests for read commands is based on implementing one or more milestones in the read process flow. In this regard, the controller 202 monitors the progress of the read process flow and determines whether a progress milestone/step has been reached. The progress milestone may include, for example, read sensing (DONE) completed in response to a polling event sent from the controller to the NAND die in the storage device 102.

Pre-filling of ATC may also be desirable for handling write commands in a fairly high QD environment (e.g., QD > 50). For example, in an unobstructed stream of write commands, there may be a long command queue on the host's commit queue. The command may be brought into the controller and parsed. Commands found to be write commands during the parse phase are routed to the write processing pipeline and begin as much as possible, and write data provided with the commands is transferred into the internal buffer. And then acknowledges the write command completion. The write data continues to reside in the volatile buffer and is programmed to the NAND for non-volatile storage at a speed independent of the write completion. In the event of a power outage, the capacitor holds the SSD's energy long enough for the volatile cache programming output to the NAND.

However, there is a possibility that the write command is very long and the data amount is very large, or there is a possibility that the write command to be sent to the drive at one time is very large. Both of these conditions can result in the controller running out of buffer space. In this case, the write completion is maintained by the controller. Writes are managed on an internal queue (e.g., write buffer 214) while it waits for sufficient buffer space. Once all the write data is in the drive in either case, the write completion is sent to the host. Thus, a large write command may program the initial data out of the buffer, while a later portion of the large write command can wait for buffer space to be freed.

More specifically, in processing write commands, the controller 202 can store data transferred from the host memory 104 in the write buffer 214 before being transferred into the NVM medium 206. In a relatively high QD environment capable of handling multiple write commands at a given time, one or more of the multiple write commands may need to wait for write buffer 214 to become available for processing to proceed. In one embodiment, multiple write commands SQEs or possibly one or more large write commands SQEs provided by the host to the device will require more write buffers 214 than are available. One, more than one, or the remainder of the write commands may be placed in an internal write command queue 216. In one embodiment, the controller 202 selects a subset of write commands waiting in the internal write command queue 216 to request address translation for the selected write command. In this way, the ATC may be pre-filled with translated addresses of the selected write command.

Pre-population of ATCs may also be desirable for write commands that process data stored in a relatively long list of memory locations within host memory 104. The memory locations may be identified in one or more data structures, such as a linked list of SGL segments and/or SGL descriptors. In one embodiment, one or more address translation requests are sent for translating virtual addresses in the SGL segment and/or a first one of the SGL descriptors to corresponding physical addresses. The controller 202 may perform a direct memory access of the host memory 104 to retrieve data stored in the translated physical address. The retrieved data may be stored in write buffer 214, and the data in write buffer 214 may be written to NVM medium 206 (e.g., based on virtual block address (LBA) information in the received write command).

In the case where the second SGL segment and/or SGL descriptor stores the virtual address of the write command, controller 202 may send an address translation request for the virtual address in the second SGL segment for pre-populating the ATC with translated addresses while the first direct memory access occurs based on the translated address of the first SGL segment. The early translation address may allow a second direct memory access (based on the translated address of the second SGL segment) to follow the first direct memory access without waiting for address translation of the virtual address in the second SGL segment to complete. In one embodiment, the ATC continues to be pre-filled for the next SGL segment while the write command is processed based on the current SGL segment until the last SGL segment has been processed.

In another embodiment, the PRP segment may proceed in a similar manner as the SGL segment and/or SGL descriptor. Each PRP segment pointing to another PRP segment will have an address that needs to be translated. The controller 202 may request address translation before translation is needed. Alternatively, one or more PRP segments may be read into the controller 202 to begin parsing the PRP. This early reading of the PRP also enables the controller 202 to make a potential early address translation request.

Fig. 3A-3B are flowcharts of a read flow without ATC pre-filling. Flow begins and in act 280, controller 202 receives and parses a SQE containing a read command.

In act 282, controller 202 identifies a location in NVM medium 206 where to retrieve the read data. For example, the controller 202 may perform a lookup in a logical-to-physical translation table in the internal memory 204 of the storage device for identifying LBAs in which to retrieve the read data.

In act 284, a read sense command is issued to NVM medium 206.

In act 286, read sensing is performed and the read data is loaded into latches of the NVM media.

In act 288, the data in the latches is transferred to the controller 202.

In act 290, the controller 202 invokes an error recovery action, such as, for example, ECC decoding, to correct errors in the read data.

In act 292, the controller 202 invokes an optional second stage error recovery action in the event that the error correction in act 290 was unsuccessful in correcting all errors. The second stage error recovery action may be, for example, RAID-like outer ECC decoding.

In act 294, the read data may be decrypted and any scrambling of the data may be reversed in act 296.

In act 297, the controller 200 sends an address translation request to the translation agent 108 in the host device 100 and stores the translated address in the ATC 118.

In act 298, the controller 200 transmits the data to the host device 100. In this regard, the controller 200 retrieves the translated address in which the data is stored from the ATC 188 and stores the data in the location of the host memory 104 identified by the translated address.

In act 299, the controller 200 sends a Completion Queue Entry (CQE) in a completion queue stored in the host memory 104 and alerts the host processor 105 that the read request is complete. This may be accomplished, for example, with an interrupt to the host device 100.

Fig. 4 is a flow diagram of a process for pre-populating ATC 118 based on timers, according to one embodiment. The process begins and in act 300, controller 202 receives and parses a SQE that stores read commands from host 100. The SQE may be stored in a commit queue in the host device 100, for example. The read command may follow a communication protocol such as, for example, NVMe protocol. The read command may identify, for example, an address (e.g., a starting LBA) of the NVM medium 206 from which the data is to be read, an amount of data to be read, and a virtual address in the host memory 104 where the data is to be transferred via DMA.

The controller 202 starts a timer in act 302. The length of the timer may be shorter than the typical/expected delay of a read command.

In one embodiment, while the timer is started in act 302, controller 202 also initiates and executes a read flow in act 312 in accordance with the received read command. The read flow may be similar to a portion of the read flow of fig. 3A. For example, the read flow of act 312 may implement acts 282-292 of FIG. 3A.

While the read flow is executing, the controller 202 checks in act 304 whether the timer has expired. In one embodiment, the timer is set to expire before the read flow of act 312 is completed.

If the timer has expired, the controller 202 checks the ATC 118 in act 306 to determine if it contains a translation of the virtual address identified in the read command. If the address translation is already in the cache, then the request for address translation has been satisfied and actions 308 and 310 may be skipped.

Referring again to act 302, if the ATC 118 does not already contain a translation of the virtual address, the controller 202 submits an address translation request for the virtual address to the TA 108 in act 308. The TA 108 may search the ATPT 110 for a virtual address and output a corresponding physical address.

In act 310, the translated physical address is stored in the ATC 118 in association with the corresponding virtual address.

In act 313, the controller 202 waits for the read flow of act 312 to complete. If the read flow of act 312 takes longer than expected, the translated address in ATC 118 may have been evicted and it may no longer be in ATC 118. Accordingly, in act 314, the controller 202 checks to determine if the ATC 118 still contains translations of virtual addresses.

If the answer is no, the controller 202 submits an address translation request for the virtual address to the TA 108 in act 316.

In act 318, controller 202 stores the translated address in ATC 118 in association with the corresponding virtual address.

In act 320, the controller marks the ATC entry containing the virtual address as busy. Marking busy may be used to prevent transitions from becoming invalid and there are race conditions on the device using the AT. In this regard, the host may select a mobile host memory address page for one or more reasons. When the host moves pages, the host communicates with the TA and the TA may update the ATPT. The TA may broadcast the null packet to one or more endpoints with ATC. The packet tells the terminal device to invalidate the address or address range and the device removes the address from the ATC. If the device needs to switch, a new request for TA will be needed.

When the ATC entry is marked busy, the ATC maintains the old inactive AT. Allowing the device to use the old outdated value for the ongoing data transfer. The host is responsible for overriding the race condition in a vendor specific manner. Upon completion of the AT usage, the ATC component on the device evicts the AT from the cache and completes the invalidation request from the TA.

In act 322, the read data is transferred to the translated physical address in host memory 104.

In act 324, the controller 202 submits a CQE to the host to indicate completion of the read command.

Fig. 5 is a flow diagram of a process for populating ATC 118 with translated addresses according to one embodiment. The process of fig. 5 may be performed, for example, as part of the execution of acts 306-308 or 314-316 of fig. 4.

In act 400, the controller 202 determines whether the virtual address of the read command is in the ATC 118. If the answer is no, the controller 202 submits an address translation request for the virtual address to the TA108 in act 402.

In act 404, TA108 responds to controller 202 with the translated address or translation failure.

In act 406, it is checked whether the conversion was successful. If the answer is no (meaning a translation failure response), the TA does not find the address in the ATPT. The page with the requested memory address does not exist in host device memory 104. In this case, the PRI request is sent to the TA108 in act 408.

The TA108 performs a host-specific operation to populate the ATPT of the host device memory address 104 and the TA108 completes the PRI in act 410.

In act 412, the controller 202 submits another address translation request for the virtual address to the TA 108.

In act 414, the translated address is received by the controller for storage in the ATC 118.

Fig. 6 is a flowchart of a process for pre-populating the ATC 118 based on at least two timers, according to one embodiment. The process begins and in act 450, controller 202 receives and parses a SQE containing a read command.

In act 452, the controller 202 starts a first timer having a first expiration value/length and a second timer having a second expiration value/length. The first timer may be set as described with reference to act 302 of fig. 4.

In one embodiment, the length of the second timer is longer than the length of the first timer. The second timer may be set based on a potential event that may occur during execution of the read process, which may delay completion of the read process. The potential event may be, for example, a second stage error recovery action (e.g., a second stage of ECC decoding) that may be initiated if the first stage ECC decoding fails in correcting errors in data retrieved from NVM medium 206. In one embodiment, detected events, such as second stage error recovery, have a predicted/expected delay. At 452, the controller 202 may start a second timer based on the predicted delay.

Simultaneously with the start of the first timer and the second timer, a read flow may be initiated and executed in act 470 in accordance with the received read command. The read flow may be similar to a portion of the read flow of fig. 3A. For example, the read flow of act 470 may implement acts 282-292 of FIG. 3A.

While the read flow is executing, the controller 202 checks in actions 454 and 462 whether the first timer and the second timer, respectively, have expired. In one embodiment, the first timer expires first before the read flow of act 470 is completed.

If the first timer has expired, the controller 202 checks the ATC 118 in act 456 to determine if it contains a translation of a virtual address.

If the ATC 118 does not contain a translation of the virtual address, then in act 458 the controller 202 submits an address translation request for the virtual address to the TA 108. The TA108 may search the ATPT 110 for a virtual address and output a corresponding physical address.

In act 460, the translated physical address is stored in the ATC 118 in association with the corresponding virtual address.

In one embodiment, an event is detected during execution of the read flow in act 470. The event may be, for example, the execution of a second stage error recovery action (e.g., an outer ECC decoding action), which may delay the execution of the read flow in action 470. The delay in execution may cause the translated address to be evicted from being evicted by ATC 118.

In one embodiment, setting the second timer in act 462 allows the ATC to be pre-filled if the translated address has been evicted from the ATC 118. The second timer may be set to expire substantially near the end of the second stage error recovery action. In this regard, in act 462, a determination is made as to whether the second timer has expired.

If the answer is yes, then a determination is made in act 464 whether the translated address is still in ATC 118.

If the answer is no, the controller 202 submits an address translation request for the virtual address to the TA108 in action 466. The TA108 may search the ATPT110 for a virtual address and output a corresponding physical address.

In act 468, the translated physical address is stored in ATC 118 in association with the corresponding virtual address.

When the read flow of act 470 completes execution, another check is made in act 472 as to whether the translated physical address is still in ATC 118. The read flow of act 470 may have changed over time. For example, it may already include a second stage ECC recovery that can take a very long time.

If the answer is no, the controller 202 submits an address translation request for the virtual address to the TA108 in act 474.

In act 476, the controller 202 stores the translated address in association with the corresponding virtual address in the ATC 118.

In act 478, controller 202 marks the ATC entry containing the virtual address as busy.

In act 480, the read data (e.g., temporarily stored in read buffer 212) is transferred to the translated physical address in host memory 104.

In act 482, the controller 202 submits a CQE to the host to indicate completion of the read command.

It should be appreciated that when the read command is complete and the DMA of the data uses the translated address in ATC 118, the address in the ATC may come from the successful use of the first timer. In this case, the second timer may not be needed. In one embodiment, the second timer is stopped/deleted such that the additional address translation request in act 466 is not invoked. In some embodiments, the second timer is allowed to continue running and the redundant second request is allowed to continue.

Fig. 7 is a flowchart of a process for pre-populating ATC 118 based on monitoring the progress of a read process flow, in accordance with one embodiment. The process begins and in act 500, the controller 202 identifies a read command from the host 100 for the storage device 102.

In act 502, the controller 202 monitors for execution of a read command. For example, controller 202 may monitor the progress of milestones/steps implemented during read process 503. The read flow may be similar to a portion of the read flow of fig. 3A. For example, the read flow of act 503 may implement acts 282-292 of FIG. 3A.

While the read flow is executing, controller 202 checks in act 504 whether a progress milestone has been reached. A milestone may be, for example, the detection of a successful completion of a sense signal by a sense amplifier. In some embodiments, the milestone may be another step prior to detecting the successful completion of the sensing signal.

If the answer is yes, then a determination is made in act 505 as to whether the translated address is in ATC 118. If the answer is no, the controller 202 sends a request to the TA108 for address translation of the virtual address contained in the read command in act 506. The TA108 may search the ATPT 110 for a virtual address and output a corresponding physical address.

In act 508, the translated physical address is stored in ATC 118.

In act 510, controller 202 waits for the read flow of act 503 to complete. If the read flow of act 503 takes longer than expected, then the translated address in ATC 118 may have been evicted from ATC 118. Accordingly, in act 512, the controller 202 checks to determine if the ATC 118 still contains translations of virtual addresses.

If the answer is no, the controller 202 submits an address translation request for the virtual address to the TA108 in action 514.

In act 516, the controller 202 stores the translated address in association with the corresponding virtual address in the ATC 118.

In act 518, controller 202 marks the ATC entry containing the virtual address as busy.

In act 520, the read data is transferred to the translated physical address in host memory 104.

In act 522, controller 202 submits a CQE to the host to indicate completion of the read command.

It should be appreciated that one or more timers may be set with respect to the progress marker. For example, the controller 202 may start a timer (e.g., 30 microseconds) after reaching the step of command read sensing to the NAND die.

Fig. 8 is a flow diagram of a process for pre-populating ATC 118 in a relatively high QD environment in which multiple write commands can be processed at a given time, according to one embodiment. The process begins and in act 600, controller 202 identifies a plurality of write commands to storage device 102 from host 100. Multiple write commands may need to wait for the availability of write buffer 214 for their processing to continue. While multiple write commands are waiting, the controller 202 may place the commands in an internal write queue 216 in the internal memory 204 of the storage device.

In act 602, the controller 202 selects a subset of write commands waiting in the internal write queue 216. The selected subset may be less than the total number of write commands stored in the queue 216. The selected write commands may be in any order.

In act 604, the controller 202 identifies a virtual address in the subset of the selected write commands.

In act 606, the controller 202 sends one or more address translation requests for the identified virtual address to the TA 108.

In act 608, the controller 202 receives the translated physical address from the TA108 and stores the physical address in association with the corresponding virtual address in the ATC 118. Because of the pre-filling of ATC 118, address translation of one of the write commands in internal write queue 216 can be bypassed when retrieving and processing commands from the optional queue.

In act 610, the controller 202 transfers data according to one or more write commands in the internal write queue 216 based on the physical address in the ATC 118.

FIG. 9 is a flowchart of a process for pre-populating an ATC with write commands that invoke a relatively long list of memory locations, according to one embodiment. The process begins and in act 700, controller 202 identifies a write command from host 100. The write command may include, for example, an address (e.g., a starting LBA) of the NVM medium 206 in which the data is to be written, an amount of data to be written, and a first SGL segment containing a first virtual address in the host memory 104, the first virtual address containing the data to be written, and a pointer to a second SGL segment.

In act 702, controller 202 identifies a first virtual address from a first SGL segment.

In act 704, the controller 202 sends an address translation request for the first virtual address and stores the translated first physical address in association with the first virtual address in the ATC 118.

In act 706, the controller 202 participates in the data transfer based on the translated first address. The data transfer may include accessing the host memory 104 to read data from the translated first address, storing the read data in the write buffer 214, and writing the data in the write buffer 214 to a location of the NVM media 206 based on a starting LBA in the write command.

In act 708, concurrently with the data transfer from the translated first address, controller 202 identifies a second SGL segment pointed to by the first SGL segment and also identifies a second virtual address in the second SGL segment.

In act 710, the controller 202 sends an address translation request for the second virtual address and stores the translated second physical address in association with the second virtual address in the ATC 118.

In act 712, controller 202 participates in the data transfer based on the translated second address. Although not explicitly shown in the flow of fig. 7, those skilled in the art will appreciate that the ATC continues to be pre-filled for the next SGL segment while processing the write command based on the current SGL segment until the last SGL segment has been processed.

It should be appreciated that at each instance of the controller 202 requesting address translation of the ATC, there is an option to update the ATC cache eviction scheme or not. For example, the first request in each flow chart may update a cache eviction scheme. However, the cache eviction scheme may not consider a second or later request for address translation. This may enable, for example, only one cache eviction scheme update per command.

In some embodiments, the discussion above is implemented in one or more processors. The term processor may refer to one or more processors and/or one or more processing cores. The one or more processors may be hosted in a single device or distributed across multiple devices (e.g., on a cloud system). The processor may include, for example, an Application Specific Integrated Circuit (ASIC), a general purpose or special purpose Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Graphics Processing Unit (GPU), a programmable logic device such as a Field Programmable Gate Array (FPGA). In a processor, as used herein, each function is performed by hardware configured to be, i.e., hardwired to perform the function, or by more general purpose hardware, such as a CPU, configured to execute instructions stored in a non-transitory storage medium (e.g., memory). The processor may be fabricated on a single Printed Circuit Board (PCB) or distributed over several interconnected PCBs. The processor may contain other processing circuitry; for example, the processing circuitry may include two processing circuits interconnected on a PCB, an FPGA and a CPU.

It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section without departing from the spirit and scope of the present inventive concept.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concepts. Moreover, unless explicitly stated, the embodiments described herein are not mutually exclusive. Aspects of the embodiments described herein may be combined in some implementations.

With respect to the flowcharts of fig. 3-7, it should be appreciated that the order of the steps of the processes in these flowcharts is not fixed, but can be modified, changed in order, performed differently, performed sequentially, concurrently or simultaneously, or changed to any desired order as would be recognized by one of skill in the art.

As used herein, the terms "equivalent," "about," and similar terms are used as approximate terms, rather than degree terms, and are intended to take into account the inherent deviations in measured or calculated values that one of ordinary skill in the art would recognize.

As used herein, the singular is intended to include the plural as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. An expression such as "at least one of" when following a list of elements modifies the entire list of elements without modifying the individual elements of the list. Furthermore, the use of "may" in describing embodiments of the inventive concepts refers to "one or more embodiments of the present disclosure. Furthermore, the term "exemplary" is intended to refer to an example or illustration. As used herein, the terms "use," "in use," and "already in use" may be considered synonymous with the terms "utilized," "in use," and "already in use," respectively.

It will be understood that when an element or layer is referred to as being "on," "connected to," "coupled to" or "adjacent to" another element or layer, it can be directly on, connected to, coupled to or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," "directly connected to," "directly coupled to," or "directly adjacent to" another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges subsumed with the same numerical precision within that range. For example, a range of "1.0 to 10.0" is intended to include all subranges between (and inclusive of) the minimum value of 1.0 and the maximum value of 10.0, i.e., having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0, e.g., 2.4 to 7.6. Any maximum numerical limitation described herein is intended to include all lower numerical limitations subsumed therein, and any minimum numerical limitation described herein is intended to include all higher numerical limitations subsumed therein.

While exemplary embodiments of a system and method for pre-populating an address cache according to one embodiment have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Thus, it should be understood that systems and methods for pre-populating an address cache constructed in accordance with the principles of the present disclosure may be implemented in a manner other than as specifically described herein. The disclosure is also defined in the appended claims and equivalents thereof.

Claims

1. A method for processing commands from a host computing device to a storage device, the method comprising:

identifying, by the storage device, a command from the host computing device, the command comprising a logical address;

detecting conditions;

requesting, by the storage device, a conversion of the logical address to a physical address based on the detected condition;

storing, by the storage device, the physical address in the cache; and

based on the physical address, data is transferred according to the command.

2. The method of claim 1, further comprising:

a first timer is started by the storage device in response to identifying the command, wherein the condition includes detecting expiration of the first timer.

3. The method of claim 2, wherein a length of the first timer is shorter than a predicted delay of the processing command.

4. The method of claim 2, wherein a length of the first timer is dynamically calculated based on a number of activity commands to be processed.

5. The method of claim 2, further comprising:

setting a second timer by the storage device;

detecting expiration of the second timer;

the logical address is requested to be translated to a physical address by the storage device based on expiration of a second timer, wherein a length of the second timer is based on an expected length of the completion event during processing of the command by the storage device.

6. The method of claim 5, wherein the event comprises invoking an error recovery action.

7. The method of claim 1, wherein the command comprises a command to read data from a storage device, wherein a logical address is to store data in a memory location of a host computing device.

8. The method of claim 1, further comprising:

the progress of execution of the command is monitored, wherein detecting the condition includes determining that a milestone has been reached.

9. The method of claim 8, wherein the monitoring comprises monitoring a plurality of steps performed for a command.

10. The method of claim 8, wherein the milestone comprises completing sensing of signals from one or more memory cells of a storage device.

11. A storage device, comprising:

a cache; and

a processor coupled to the cache, the processor configured to execute logic that causes the processor to:

identifying a command from a host computing device, the command comprising a logical address;

detecting conditions;

requesting conversion of the logical address to a physical address based on the detected condition;

Storing the physical address in a cache; and

based on the physical address, data is transferred according to the command.

12. The storage device of claim 11, further comprising:

the first timer is started in response to identifying the command, wherein the logic that causes the processor to detect the condition includes logic that causes the processor to detect expiration of the first timer.

13. The storage device of claim 12, wherein a length of the first timer is shorter than a predicted delay of processing the command.

14. The storage device of claim 12, wherein the logic causes the processor to dynamically calculate the length of the first timer based on a number of active commands to be processed.

15. The storage device of claim 12, wherein the logic further causes the processor to:

setting a second timer;

detecting expiration of the second timer;

upon expiration of a second timer, the logic causes the processor to request conversion of the logical address to a physical address, wherein a length of the second timer is based on an expected length of the completion event during processing of the command by the storage device.

16. The storage device of claim 15, wherein the event comprises invoking an error recovery action.

17. The storage device of claim 11, wherein the command comprises a command to read data from the storage device, wherein the logical address is to store the data in a memory location of the host computing device.

18. The storage device of claim 11, wherein the logic further causes the processor to:

19. The storage device of claim 18, wherein the logic that causes the processor to monitor comprises logic that causes the processor to monitor a plurality of steps performed for the command.

20. The storage device of claim 18, wherein the milestone comprises completing sensing of signals from one or more memory cells of the storage device.