CN112559381B - NVMe-oriented IO deterministic optimization strategy method - Google Patents

NVMe-oriented IO deterministic optimization strategy method Download PDF

Info

Publication number
CN112559381B
CN112559381B CN202011014697.XA CN202011014697A CN112559381B CN 112559381 B CN112559381 B CN 112559381B CN 202011014697 A CN202011014697 A CN 202011014697A CN 112559381 B CN112559381 B CN 112559381B
Authority
CN
China
Prior art keywords
window
time
dtwin
request
pblk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011014697.XA
Other languages
Chinese (zh)
Other versions
CN112559381A (en
Inventor
肖利民
刘禹廷
秦广军
朱金彬
张锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202011014697.XA priority Critical patent/CN112559381B/en
Publication of CN112559381A publication Critical patent/CN112559381A/en
Application granted granted Critical
Publication of CN112559381B publication Critical patent/CN112559381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0253Garbage collection, i.e. reclamation of unreferenced memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/72Details relating to flash memory management
    • G06F2212/7205Cleaning, compaction, garbage collection, erase control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/72Details relating to flash memory management
    • G06F2212/7211Wear leveling

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

An IO deterministic optimization strategy method facing NVMe can reduce performance jitter of I/O requests and the like by optimizing based on set and window division, thereby improving garbage recovery efficiency and further prolonging the service life of a solid state disk, and is characterized by comprising the following steps: step A, realizing mutual independence among different sets by dividing the NVMe storage unit set so as to perform reading, writing and garbage recycling operations among the different sets in parallel; step B, aiming at the NVMe after the set division, a new I/O request scheduling algorithm is provided, and the conflict between garbage collection and I/O request access is avoided; and step C, designing a new cache management algorithm to sense the garbage collection operation of the NVMe set, avoiding the possibility of conflict between the garbage collection operation and the access of the I/O request to the maximum extent, and reducing the performance jitter of the I/O request.

Description

NVMe-oriented IO deterministic optimization strategy method
Technical Field
The invention relates to a computer science technology, for example, an Open Channel SSD-based scheduler and cache optimization facing NVMe I/O determinacy, and the like, in particular to an NVMe-facing IO determinacy optimization strategy method. NVMe (Non-Volatile Memory express) refers to a nonvolatile Memory system and its standard or protocol. I/O (or IO) refers to Input/Output. Open Channel SSDs (Solid State Drives) refer to Open Channel Solid State Drives or Solid State disks.
Background
With the development of memory systems, the memory medium has a great leap due to the birth of NAND flash memory. The excellent random read-write speed and the unit capacity price which is gradually reduced are more and more favored by enterprise users. The development of the solid state disk SSD is accompanied by the expansion of the bit number of the flash memory particles and the performance enhancement of the flash memory controller, so the bandwidth limitation of the SATA protocol becomes a major bottleneck restricting the continuous read-write performance of the SSD. The transition from SATA to NVMe protocol using PCIe protocol directly then becomes necessary. Although NVM devices using NAND are in constant use and development, their inherent defects are more and more difficult to ignore, compared with traditional mechanical hard disk and other block devices, NAND flash memories have a generally lower lifetime and different read-write costs of the particles. Flash memory chips have been developed from SLC (Single Level Cell) to current QLC (Quad-Level Cell), which brings about cliff-type lifetime reduction and speed reduction, and data reliability is more and more difficult to guarantee, and the SSD industry urgently needs excellent management algorithms to minimize the influence of performance degradation.
Garbage collection is an indispensable operation of NAND solid state disks due to their physical characteristics, but the garbage collection incurs huge time and space overhead. Garbage collection operation must occupy the I/O resources inside the solid state disk to block the upper layer I/O requests, which may cause a sudden increase in the read/write delay of some data at a certain time, which is called I/O jitter. Another factor that may introduce I/O jitter is inter-user crosstalk. Because the capacity of the solid state disk is continuously increased, the capacity of the solid state disk is divided into different users, but due to the characteristic of random writing, read-write operations among users may occupy the same channel, so that when a certain user accesses, the access of other users needs to be blocked, and for the blocked user, a rapidly increased delay is also introduced.
Current industry solutions to latency include: and optimizing a traditional scheduler and optimizing a traditional cache management algorithm from the aspect of improving the garbage recovery efficiency. In the aspect of garbage collection, the existing research improves the efficiency of garbage collection from the aspects of dynamically determining a free page ratio threshold, reducing a blocking unit of garbage collection and establishing an interruptible garbage collection algorithm, but the existing research cannot completely realize the performance of no garbage collection. Based on the relatively independent property between the host end and the solid state disk, the host end cannot know when the solid state disk is in the garbage collection state, so that the tail delay caused by garbage collection operation cannot be completely eliminated. In the aspect of scheduling strategies, the existing research tends to improve and design scheduling algorithms for the parallelism of the solid state disk, including ways of separately creating request queues and request filters for parallel units. In the aspect of cache management, most of the existing research focuses on improving a traditional LRU (Least recently used) replacement algorithm, reducing the number of cache replacement times and improving the cache hit rate.
Disclosure of Invention
Aiming at the problem of NVMe performance jitter caused by flash memory garbage recovery, the invention provides an NVMe-oriented IO deterministic optimization strategy method, which can reduce the performance jitter of I/O requests and the like by optimizing based on set and window division, thereby improving the garbage recovery efficiency and further prolonging the service life of a solid state disk.
The technical solution of the invention is as follows:
an IO deterministic optimization strategy method facing NVMe is characterized by comprising the following steps: step A, realizing mutual independence among different sets by dividing the NVMe storage unit set so as to perform reading, writing and garbage recycling operations among the different sets in parallel; step B, aiming at the NVMe after the set division, a new I/O request scheduling algorithm is proposed, and the conflict between garbage recovery and I/O request access is avoided; and step C, designing a new cache management algorithm to sense the garbage collection operation of the NVMe set, avoiding the possibility of conflict between the garbage collection operation and the access of the I/O request to the maximum extent, and reducing the performance jitter of the I/O request.
The step A also comprises the steps of dividing parallel units to obtain mutually independent Set groups by utilizing a plurality of parallel units existing in NVMe, and ensuring mutual parallelism among the Set groups; the partitioned Set will have two different time windows: firstly, determining a delay window, namely DTWIN, without performing garbage recycling and abrasion balance operation; the other is a non-deterministic window, namely NDWIN, for maintaining the Set performance; determining a deferral window may ensure that there is no I/O blockage caused by any controller operation within the window; the non-deterministic window may not guarantee low latency processing of the request because of the series of operations that block I/O; the Set group opened for division is given with attributes, the time in the window is stored, a threshold value is Set for the attributes, the threshold value exceeding is Set as a trigger, and window switching is triggered.
And the step B also comprises the steps that the host side sends the request to the pblk, after the judgment of the set characteristics, the scheduling algorithm schedules the request queue, the request with the operation in a determined delay window is directly dispatched, and the request with the operation in a non-determined delay window is subjected to delay scheduling.
The new cache management algorithm in step C has the following two functions: firstly, dividing a cache space into cache blocks based on Set, and establishing association between the cache blocks and a Set; and secondly, the pblk of the Set management algorithm senses the time window of the Set management algorithm, and when the corresponding Set is sensed to be in the non-deterministic time window, namely the Set is likely to be in garbage collection operation, the data is temporarily prevented from being replaced out of the cache.
The optimization strategy method comprises the following steps: a link a, dividing pblk objects with different parallel units LUNs (logical unit numbers), namely sets, based on LightNVM equipment, and dividing different time windows for different sets; in the link b, the pblk system loads a scheduler and schedules an I/O request of the file system; and c, loading cache management of the pblk system.
The link a comprises the following steps of establishing window configuration for pblk and controlling window switching based on time: a1 A pblk object, i.e., a collection, is initialized to assign three window time thresholds, including a minimum time within a determined delay window, DTWIN _ MIN, a maximum time within a determined delay window, DTWIN _ MAX, and a maximum time within a non-determined delay window, NDWIN _ MAX; a2 After the set is created, the default is in a determined delay window and the recording time is started; a3 When the set is in DTWIN, the garbage collection thread will make a judgment of window switching; a4 Judging the information such as time, error rate, available blocks and the like, and then switching the garbage recycling thread from DTWIN to NDWIN; a5 When the set is in NDWIN, the garbage collection thread will also make a judgment of window switching; a6 Pblk switches from NDWIN to DTWIN after judging time, available blocks and other information; a7 A3-a6 above.
The step a comprises the following steps of judging the time: when the set is in DTWIN, the garbage recycling process compares the time of the window where the current set is located with a time threshold when the set is switched to NDWIN; when the time of the collection in DTWIN is less than or equal to DTWIN _ MIN, window switching is not carried out completely, namely, garbage recycling operation is not started; when the time of the set in DTWIN is longer than DTWIN _ MIN and less than or equal to DTWIN _ MAX, the window switching is judged by using the reading error rate and the available block limit; when the time of the set in the DTWIN is longer than the DTWIN _ MAX, window switching from the DTWIN to the NDWIN is forced; when the time of the set in the NDWIN is less than or equal to NDWIN _ MAX, the original judgment of ending the garbage recycling operation is used for carrying out switching judgment; when the time that the set is in NDWIN is greater than NDWIN _ MAX, the time window of the set is forced to switch to DTWIN.
The link b comprises a scheduling algorithm adopting the following steps: b1 An entry function that replaces the original pblk; b2 Universal block level calls the entry function to pass into bio; b3 The scheduler obtains the window attribute of the set to which the scheduler belongs; b4 When the window is in the determined time window, the original dispatch logic is used for dispatching, and when the window is in the undetermined delay window, the generic _ make _ rq () of the general block layer is called to inform the general block layer to dispatch and threaten a request; b5 B2-b4 when new bio reaches the generic block layer; b1 is the step performed during initialization of the pblk module, and b2-b4 are continuous cyclic processes during operation of the pblk module.
The link c comprises a cache management algorithm adopting the following steps: c1 Ring buffer initialization, determining buffer size according to the number of LUNs, initializing buffer entry addresses, and data writing and flushing pointers; c2 After the write request is written into the annular cache, judging the window where the set is located; c3 When the set is in the determined time window, the cache normally performs the awakening work of the write thread to flush the data in the cache, and when the set is in the undetermined time window, the awakening of the write thread is suspended, so that the call of the garbage recovery thread to the write thread is avoided.
The invention has the technical effects that: the invention provides set partitioning, time window implementation, an I/O scheduler and a cache management algorithm based on a LightNVM. Compared with the prior art, the invention has the main advantages that: 1. a time window implementation of pblk (Physical Block Device) is proposed: on the basis of the realized multi-pblk object configuration, a time window concept is added, so that the parallel window configuration and switching among pblk objects are realized, and a system interface is provided for inquiring a window where the pblk is located. 2. Improved scheduling algorithm: the traditional scheduling algorithm is generally designed based on the overall parallelism of the solid state disk, but under the background that the capacity and the parallelism of the solid state disk are continuously increased, the traditional scheduling algorithm cannot meet the requirement of scheduling the I/O requests of the solid state disk after set division and window configuration. The improved scheduling algorithm provided by the invention meets the requirement, and can perform scheduling according to the set division and the window of the solid state disk, thereby reducing the tail delay of the I/O request. 3. Improved cache management algorithm: the cache management algorithm provided by the invention optimizes based on set and window division on the hit rate advantage of the existing cache management algorithm, improves the garbage recovery efficiency and further prolongs the service life of the solid state disk.
The invention is different from the traditional solid state disk optimization algorithm in that: (1) the idea of reducing tail delay is different: conventional solutions to reduce the average delay and tail delay of requests mainly focus on optimizing the garbage collection algorithm and reducing the interference between requests, including reducing garbage collection blocking units, making the garbage collection process interruptible, filtering requests to reduce crosstalk, etc. A method of dividing a set is adopted to separate the spaces of different user operations and reduce the occurrence of crosstalk among users as much as possible, a method of dividing a time window is adopted to realize the certainty of I/O delay, and the request delay in DTWIN is reduced as much as possible by not carrying out garbage recycling operation which affects the performance. (2) the optimization ideas of the schedulers are different: the conventional scheduler does not consider the problem of set partitioning, and we propose an improved scheduler after considering the above factors. (3) the optimization ideas of the cache management algorithm are different: the traditional cache management algorithm focuses on improving the hit rate and reducing the read-write times of the solid-state hard disk block, and does not consider the configuration of a set and a window, and an improved method for combining the cache and the set is provided after the factors are considered.
Drawings
Fig. 1 is an overall structural view of the present invention.
Fig. 2 is a decision algorithm for switching a deterministic latency window to a non-deterministic latency window.
Fig. 3 is a decision algorithm for switching the non-deterministic deferral window to the deterministic deferral window.
Fig. 4 is a schematic diagram of a scheduler.
FIG. 5 is a diagram of pblk buffer management and decision window wake up.
The Chinese concept of all the character labels in the figure is listed as follows: host-side based FTL implementation of pblk-open channel; req-a request issued by a general block layer of an operating system; set-a collection concept in text; DTWIN _ MIN-the shortest time in a defined time window; DTWIN _ MAX-is at the maximum time of the determined time window; app-application layer program; a garbage recovery thread of pblk _ gc _ ts-pblk; a write thread of pblk _ write _ ts-pblk; caching of pblk _ cach-pblk; NDWIN-non-deterministic time window; generic _ make _ rq-bio down interface of generic block layer.
Detailed Description
The invention is explained below with reference to the figures (fig. 1-5) and examples.
Referring to fig. 1 to 5, an NVMe-oriented IO deterministic optimization strategy method includes the following steps: step A, realizing mutual independence among different sets by dividing NVMe storage unit sets so as to carry out reading, writing and garbage recycling operations among different sets in parallel; step B, aiming at the NVMe after the set division, a new I/O request scheduling algorithm is provided, and the conflict between garbage collection and I/O request access is avoided; and step C, designing a new cache management algorithm to sense the garbage collection operation of the NVMe set, avoiding the possibility of conflict between the garbage collection operation and the access of the I/O request to the maximum extent, and reducing the performance jitter of the I/O request. The step A also comprises the steps of dividing parallel units to obtain mutually independent Set groups by utilizing a plurality of parallel units existing in NVMe, and ensuring mutual parallelism among the Set groups; the partitioned Set will have two different time windows: firstly, determining a delay window, namely DTWIN, without performing garbage recycling and abrasion balance operation; the other is a non-deterministic window, namely NDWIN, for maintaining the Set performance; determining a deferral window may ensure that there is no I/O blockage caused by any controller operation within the window; the non-deterministic window may not guarantee low latency processing of the request because of the series of operations that block I/O; the Set group opened for division is given with attributes, the time in the window is stored, a threshold value is Set for the attributes, the threshold value exceeding is Set as a trigger, and window switching is triggered. And the step B also comprises that the host sends the request to the pblk, after the judgment of the set characteristics, the scheduling algorithm schedules the request queue, directly dispatches the request with the operation in the determined delay window and performs delay scheduling on the request with the operation in the undetermined delay window. The new cache management algorithm in step C has the following two functions: firstly, dividing a cache space into cache blocks based on Set, and establishing association between the cache blocks and a Set; and the pblk of the Set management algorithm senses the time window thereof, and temporarily avoids replacing the data with the cache when sensing that the corresponding Set is in the non-deterministic time window, namely the Set is possibly in garbage recovery operation.
The optimization strategy method comprises the following steps: a link a, dividing pblk objects with different parallel unit LUNs, namely sets, based on a LightNVM device, and dividing different time windows for different sets; in the link b, the pblk system loads a scheduler and schedules an I/O request of the file system; and c, loading cache management of the pblk system. The link a comprises the following steps of establishing window configuration for pblk and controlling window switching based on time: a1 A pblk object, i.e., a collection, is initialized to assign three window time thresholds, including a minimum time within a determined delay window, DTWIN _ MIN, a maximum time within a determined delay window, DTWIN _ MAX, and a maximum time within a non-determined delay window, NDWIN _ MAX; a2 After the set is created, the default is in a determined delay window and the recording time is started; a3 When the set is in DTWIN, the garbage collection thread will make a judgment of window switching; a4 The garbage recycling thread can be switched from DTWIN to NDWIN after judging the information such as time, error rate, available blocks and the like; a5 When the set is in NDWIN, the garbage collection thread also judges the window switching; a6 Pblk switches from NDWIN to DTWIN after judging time, available blocks and other information; a7) Repeating a3-a6. The link a comprises the following steps of judging the time: when the set is in DTWIN, the garbage recycling process compares the time of a window where the current set is located with a time threshold when the set is switched to NDWIN; when the time of the collection in DTWIN is less than or equal to DTWIN _ MIN, window switching is not carried out completely, namely, garbage recycling operation is not started; when the time of the set in DTWIN is longer than DTWIN _ MIN and less than or equal to DTWIN _ MAX, the window switching is judged by using the reading error rate and the available block limit; when the time of the set in the DTWIN is longer than the DTWIN _ MAX, window switching from the DTWIN to the NDWIN is forced; when the time of the set in the NDWIN is less than or equal to NDWIN _ MAX, the original judgment of ending the garbage recycling operation is used for carrying out switching judgment; when the time that the set is in NDWIN is greater than NDWIN _ MAX, the time window of the set is forced to switch to DTWIN. The link b comprises a scheduling algorithm adopting the following steps: b1 An entry function that replaces the original pblk; b2 Universal block level calls the entry function to pass into bio; b3 The scheduler obtains the window attribute of the set to which the scheduler belongs; b4 When the window is in the determined time window, the original dispatch logic is used for dispatching, and when the window is in the undetermined delay window, the generic _ make _ rq () of the general block layer is called to inform the general block layer to dispatch and threaten a request; b5 Loop over b2-b4 when the new bio reaches the generic block layer; b1 is the step performed during initialization of the pblk module, and b2-b4 are continuous cyclic processes during operation of the pblk module. The link c comprises a cache management algorithm adopting the following steps: c1 Initializing a ring buffer, determining the buffer size according to the number of LUNs, initializing a buffer entry address, and writing and flushing data; c2 After the write request is written into the annular cache, judging the window where the set is located; c3 When the set is in the determined time window, the cache normally wakes up the write thread to flush the data in the cache, and when the set is in the undetermined time window, the wake-up of the write thread is suspended to avoid the call of the garbage recovery thread to the write thread.
The method is built by adopting a LightNVM simulator based on an Open Channel SSD, and the Open Channel SSD is mainly characterized in that a controller part of NVMe is handed over to a host end, namely algorithms such as data layout, garbage recovery and wear balance are handed over to the host end for processing. The biggest advantage of the design mode is that the host end can definitely know the state of each position in the solid state disk in real time, the isolation between the host end and the NVMe is facilitated, the effect of predictable delay can be achieved, and the overall scheme design framework is shown in fig. 1.
1. NVMe set partitioning optimization strategy facing I/O determinacy
NVMe set partitioning is one fundamental step herein. NVMe has a plurality of parallel units, and the parallel units are divided to obtain mutually independent Set groups. The partitioned Set will have two different time windows: firstly, determining a delay window, namely DTWIN, without performing garbage recycling and abrasion balance operation; the other is an undetermined window, NDWIN, for performing the above operations to maintain Set performance. Since Set holds parallel units of NVMe, mutual parallelism between sets can be ensured. Determining a deferral window may ensure that there is no I/O blockage caused by any controller operation within the window; the non-deterministic window may not guarantee low latency processing of the request because of the sequence of operations that block I/O. In addition, the Set group divided is given with an attribute, the time within the window is stored, a threshold value is Set for the attribute, and a trigger is Set to trigger window switching when the threshold value is exceeded.
The qemu-NVMe branch of a qemu virtual machine is installed on an Ubuntu host and is used for supporting NVMe equipment, a virtual machine system is installed on the NVMe equipment, and an OCSD (online charging system) with a plurality of parallel units is simulated. And in order to enable the virtual machine subsystem to support the LightNVM and the pblk and replace the kernel of the subsystem, the kernel officially provided by the Open Channel SSD is used for compiling and installing.
The specific scheme design for the research point one is as follows: the entire NVMe is divided into a plurality of logic unit sets so as to realize mutual independence between different sets, and operations such as reading, writing, garbage recovery and the like can be performed in parallel in different sets. On the basis, labeling is carried out on the set, namely, a deterministic time window and a non-deterministic time window are given to the set. The set in the deterministic time window is ensured not to generate garbage collection operation, namely, the garbage collection and I/O request access conflict are ensured not to occur in the set at the moment; a set of time windows that are non-deterministic may introduce I/O request response delays when garbage collection is likely to occur, etc. Set is divided for NVMe capacity by host-side based controller pblk of LightNVM and manages various attributes of Set and window switching.
In the concrete implementation content of the step two, the pblk internal code is modified, and the concept of a time window is introduced. The judgment of the elapsed time is added on the basis of judging whether to enter the garbage recycling process or not by the original pblk. As shown in fig. 2 and fig. 3, the pseudo code related to window switching in step two is shown. The method comprises the following steps:
step 2.1: the pblk object, i.e., set, is initialized to assign three window time thresholds, including a minimum time within a certain delay window (DTWIN _ MIN), a maximum time within a certain delay window (DTWIN _ MAX), and a maximum time within a non-certain delay window (NDWIN _ MAX);
step 2.2: after the set is created, defaulting to be in a determined delay window and starting to record time;
step 2.3: when the set is in DTWIN, the garbage collection thread judges the window switching;
step 2.4: after the information such as time, error rate, available blocks and the like is judged, the garbage recycling thread can be switched from DTWIN to NDWIN;
step 2.5: when the set is in NDWIN, the garbage collection thread can also judge the window switching;
step 2.6: after the information of time, available blocks and the like is judged, pblk is switched from NDWIN to DTWIN;
step 2.7: repeating the steps 2.3-2.6.
2. NVMe performance jitter-oriented I/O scheduling algorithm
The scheduling algorithm implemented in the method replaces the original entry function, and because the method of writing the scheduling algorithm into the pblk is adopted, the newly constructed scheduling algorithm can easily acquire various information of the current pblk instance, including the time window in which the current pblk instance is located, the flash memory conversion information stored in the pblk, the time window threshold value and the like.
On the basis of the original pblk structure, a scheduling algorithm structure for identifying the possession of the current pblk instance is added in the optimization strategy, the scheduling algorithm structure is initialized when the instance is initialized, and the initialization steps are similar to the other working structures of pblk. In order to ensure the compatibility with the original code, a method of replacing the pblk entry function is adopted, and the original pblk entry function pblk _ make _ rq () is replaced by the pblk _ sche _ make _ rq () of the scheduling algorithm. In order to better schedule the requests, a specific scheduling algorithm is written in the above-mentioned scheduling algorithm, the requests inserted into the request queue by calling the entry function are traversed, and the requests are scheduled in a targeted manner according to the window state of the pblk instance to which the scheduler belongs. When the pblk instance is in DTWIN, an entry function of a scheduling algorithm firstly carries out window judgment, then dispatches read and write requests and sends the requests to NVMe equipment for response; when the pblk instance is in NDWIN, the algorithm will return the request to the generic block layer, informing the upper layer to schedule the next bio in the queue belonging to the device, so as to achieve the IOPS and response delay performance superior to the original design.
The method comprises the following specific steps:
step 3.1: replacing the original pblk entry function;
step 3.2: the universal block layer calls an entry function to transmit bio;
step 3.3: the scheduler acquires the window attribute of the set to which the scheduler belongs;
step 3.4: when the window is in the determined time window, the original dispatch logic is used for dispatching. When the window is in the non-deterministic deferral window, calling generic _ make _ rq () of the universal block layer to inform the universal block layer to schedule and scare a request;
step 3.5: and when the new bio reaches the universal block layer, the steps 3.2-3.4 are circulated.
Step 3.1 is the step performed when the pblk module is initialized, and steps 3.2-3.4 are the continuous cyclic process in the pblk running process.
3. NVMe cache management algorithm facing I/O performance jitter
The management algorithm has two functions as follows. Firstly, a cache space is divided into cache blocks based on Set, and the cache blocks are associated with a Set. And the pblk of the Set management algorithm senses the time window thereof, and temporarily avoids replacing the data with the cache when sensing that the corresponding Set is in the non-deterministic time window, namely the Set is possibly in garbage recovery operation.
The implementation of reading and writing in the pblk code uses two separate threads, the reading thread can directly call the NVM driving part downwards for access, and the writing thread cannot directly obtain bio data entering through the pblk entry function, but needs to store the data in a producer consumer model through a ring cache. When a write-in thread is needed to write, a program can call pblk _ write _ kick () to forcibly awaken the pblk write-in thread, and the awakened write-in thread can flush data in the annular cache as required, namely call the NVMe driver downwards to write the contents of the annular cache into the storage space of the solid state disk. What is specifically changed in this context is the logic that wakes up the write thread to swap out data when data is inserted into the ring cache, and when the cache detects that the current pblk instance is in NDWIN, the operation of waking up the write thread to write user data into the flash memory is avoided. As shown in fig. 5, when a write request is sent to pblk, the pblk buffer queries the window where the pblk instance is located after data insertion is completed, and when a garbage collection thread occupies a cache region and occupies a write thread, the cache is prevented from waking up the write thread, so that the influence of a write-in flow on a garbage collection process is reduced. Because in the design of pblk, the write thread is a single consumer of the annular cache, and the pblk garbage recovery thread also uses the annular cache to perform the temporary storage operation of data, when the user data is inserted into the annular cache, the window judgment is performed, so that the influence of the user awakening the write thread on the garbage recovery awakening write thread can be reduced to the greatest extent, the garbage recovery efficiency is improved, the garbage recovery time is reduced, namely the time of NDWIN in the operation process of the solid state disk is reduced, and the overall delay and the IOPS performance are improved.
The method comprises the following specific steps:
step 4.1: initializing a ring cache, determining the size of the cache according to the number of LUNs, initializing cache entry addresses, and writing and flushing pointers of data;
step 4.2: after the write-in request is written into the annular cache, judging the window where the set is located;
step 4.3: when the set is in the determined time window, the cache normally performs the awakening work of the write thread to flush the data in the cache, and when the set is in the undetermined time window, the awakening of the write thread is suspended, so that the call of the garbage recovery thread to the write thread is avoided.
Aiming at the problem of NVMe performance jitter caused by flash memory garbage collection, the invention can carry out the research work of a flash memory garbage collection operation and I/O request access conflict solution from the following three layers. Through reasonable NVMe storage unit set division, different sets are independent from each other, and reading, writing and garbage recycling operations are performed in parallel in different sets; aiming at NVMe after the set division, a new I/O request scheduling algorithm is proposed, and the conflict between garbage recovery and I/O request access is avoided; a new cache management algorithm is designed to sense the garbage collection operation of the NVMe set, so that the possibility of conflict between the garbage collection operation and the access of the I/O request is avoided to the maximum extent, and the performance jitter of the I/O request is reduced.
The technical scheme of the invention is as follows: firstly, through reasonable NVMe storage unit set division, mutual independence between different sets is realized, and reading, writing and garbage recycling operations can be carried out in parallel between different sets. NVMe has a plurality of parallel units, and the parallel units are divided to obtain mutually independent Set groups. The divided Set will have two different time windows: firstly, determining a delay window, namely DTWIN, without performing garbage recycling and abrasion balance operation; the other is an indeterminate window, namely NDWIN, for performing the above operation in order to maintain the Set performance. Since Set holds parallel units of NVMe, mutual parallelism between sets can be ensured. Determining a deferral window may ensure that there is no I/O blockage caused by any controller operation within the window; the non-deterministic window may not guarantee low latency processing of the request because of the sequence of operations that block I/O. In addition, the Set group divided is given with an attribute, the time within the window is stored, a threshold value is Set for the attribute, and a trigger is Set to trigger window switching when the threshold value is exceeded. Secondly, a new I/O request scheduling algorithm is provided for the NVMe after the set is divided, and the conflict between garbage collection and I/O request access is avoided. The host sends the request to pblk, after the judgment of the set characteristics, the scheduling algorithm schedules the request queue, directly dispatches the request with the operation in the determined delay window, and performs delay scheduling on the request with the operation in the undetermined delay window. And finally, a new cache management algorithm is designed to sense the garbage collection operation of the NVMe set, so that the possibility of conflict between the garbage collection operation and the access of the I/O request is avoided to the maximum extent, and the performance jitter of the I/O request is reduced. The management algorithm has two functions as follows. Firstly, a cache space is divided into cache blocks based on Set, and the cache blocks are associated with a Set. And the pblk of the Set management algorithm senses the time window thereof, and temporarily avoids replacing the data with the cache when sensing that the corresponding Set is in the non-deterministic time window, namely the Set is possibly in garbage recovery operation.
The method mainly comprises the following research points: study point 1: dividing pblk objects, namely sets, having different parallel unit LUNs based on a LightNVM device, and dividing different time windows for different sets; study point 2: the pblk system loads a scheduler and schedules the I/O request of the file system; study point 3: the pblk system loads cache management. Wherein, the research point 1 includes the configuration of creating a window for pblk and the control of window switching based on time, and the specific judgments are as follows: step 1: the pblk object, i.e., set, is initialized to assign three window time thresholds, including a minimum time within a certain delay window (DTWIN _ MIN), a maximum time within a certain delay window (DTWIN _ MAX), and a maximum time within a non-certain delay window (NDWIN _ MAX); step 2: after the set is created, defaulting to be in a determined delay window and starting to record time; and step 3: when the set is in DTWIN, the garbage collection thread judges the window switching; and 4, step 4: after the information such as time, error rate, available blocks and the like is judged, the garbage recycling thread can be switched from DTWIN to NDWIN; and 5: when the set is in NDWIN, the garbage collection thread can also judge the window switching; and 6: after the information of time, available blocks and the like is judged, pblk is switched from NDWIN to DTWIN; step 2.7: repeating the steps 2.3-2.6.
The specific judgment of time in study point 1 is as follows: when the set is in DTWIN, the garbage collection process will compare the time of the window in which the current set is located with the time threshold when switching the set to NDWIN. When the time of DTWIN is less than or equal to DTWIN _ MIN, window switching is not carried out completely, namely, garbage recycling operation is not started; when the time of DTWIN is longer than DTWIN _ MIN and less than or equal to DTWIN _ MAX, the window switching is judged by using the reading error rate and the available block limit; when the time of DTWIN is longer than DTWIN _ MAX, the window switching from DTWIN to NDWIN is forced. And when the time of the set in the NDWIN is less than or equal to NDWIN _ MAX, switching judgment is carried out by using the original judgment for finishing the garbage recycling operation, and when the time of the set in the NDWIN is more than the NDWIN _ MAX, the time window of the set is forcibly switched to the DTWIN.
The algorithm for scheduling in research point 2 is: step 2.1: replacing the original pblk entry function; step 2.2: the universal block layer calls an entry function to transmit bio; step 2.3: the scheduler obtains the window attribute of the collection to which the scheduler belongs; step 2.4: when the window is in the determined time window, the original dispatch logic is used for dispatching. When the window is in the non-deterministic delay window, calling generic _ make _ rq () of the universal block layer to inform the universal block layer to schedule and scare a request; step 2.5: the above 2.2-2.4 steps are cycled through when the new bio reaches the generic block layer. Step 2.1 is the step performed when the pblk module is initialized, and steps 2.2-2.4 are the continuous loop process in the pblk running process.
The cache management algorithm in research point 3 is as follows: step 3.1: initializing a ring cache, determining the size of the cache according to the number of LUNs, initializing cache entry addresses, and writing and flushing pointers of data; step 3.2: after the write-in request is written into the annular cache, judging the window where the set is located; step 3.3: when the set is in the determined time window, the cache normally performs the awakening work of the write thread to flush the data in the cache, and when the set is in the undetermined time window, the awakening of the write thread is suspended, so that the call of the garbage recovery thread to the write thread is avoided.
Those skilled in the art will appreciate that the invention may be practiced without these specific details. It is pointed out here that the above description is helpful for the person skilled in the art to understand the invention, but does not limit the scope of protection of the invention. Any such equivalents, modifications and/or omissions as may be made without departing from the spirit and scope of the invention may be resorted to.

Claims (5)

1. An NVMe-oriented IO deterministic optimization strategy method is characterized by comprising the following steps: step A, realizing mutual independence among different sets by dividing the NVMe storage unit set so as to perform reading, writing and garbage recycling operations among the different sets in parallel; step B, aiming at the NVMe after the set division, a new I/O request scheduling algorithm is provided, and the conflict between garbage collection and I/O request access is avoided; step C, designing a new cache management algorithm to sense the garbage recycling operation of the NVMe set, avoiding the possibility of conflict between the garbage recycling operation and the access of the I/O request to the maximum extent and reducing the performance jitter of the I/O request;
the step A also comprises the steps of dividing parallel units to obtain mutually independent Set groups by utilizing a plurality of parallel units existing in NVMe, and ensuring the mutual parallelism among the Set groups; the divided Set will have two different time windows: firstly, determining a delay window, namely DTWIN, without performing garbage recycling and abrasion balance operation; the other is a non-deterministic window, namely NDWIN, for maintaining the Set performance; determining a deferral window may ensure that there is no I/O blockage caused by any controller operation within the window; the non-deterministic window may not guarantee low latency processing of the request because of the series of operations that block I/O; assigning attributes to the Set groups which are divided, storing the time in the window, setting a threshold value for the attributes, setting the threshold value to exceed as a trigger, and triggering window switching;
the step B also comprises that the host sends the request to pblk, after the judgment of the set characteristics, the scheduling algorithm schedules the request queue, directly dispatches the request with the operation in the determined delay window and performs delay scheduling on the request with the operation in the undetermined delay window;
the new cache management algorithm in step C has the following two functions: firstly, dividing a cache space into cache blocks based on Set, and establishing association between the cache blocks and a Set; the pblk of the Set management algorithm senses the time window, and when the corresponding Set is sensed to be in the non-deterministic time window, namely the Set is likely to be in garbage recycling operation, the Set is temporarily prevented from being replaced out of the cache;
the optimization strategy method comprises the following steps: a link a, dividing pblk objects with different parallel units LUNs (logical unit numbers), namely sets, based on LightNVM equipment, and dividing different time windows for different sets; in the link b, the pblk system loads a scheduler and schedules an I/O request of the file system; and c, loading cache management of the pblk system.
2. The NVMe-oriented IO deterministic optimization strategy method of claim 1, wherein said link a comprises the configuration of creating windows for pblk and the control of window switching based on time with the following steps: a1 A pblk object, i.e., a collection, is initialized to assign three window time thresholds, including a minimum time within a determined delay window, DTWIN _ MIN, a maximum time within a determined delay window, DTWIN _ MAX, and a maximum time within a non-determined delay window, NDWIN _ MAX; a2 After the set is created, the default is in a determined delay window and the recording time is started; a3 When the set is in DTWIN, the garbage collection thread will make a judgment of window switching; a4 The garbage recycling thread can be switched from DTWIN to NDWIN after judging the time, the error rate and the information of the available blocks; a5 When the set is in NDWIN, the garbage collection thread will also make a judgment of window switching; a6 Pblk switches from NDWIN to DTWIN after judging the time and the information of the available blocks; a7 A3-a6 above.
3. The IO deterministic optimization strategy method for NVMe according to claim 1, wherein the link a comprises the following steps: when the set is in DTWIN, the garbage recycling process compares the time of the window where the current set is located with a time threshold when the set is switched to NDWIN; when the time of the collection in DTWIN is less than or equal to DTWIN _ MIN, window switching is not carried out completely, namely, garbage recycling operation is not started; when the time of the set in DTWIN is longer than DTWIN _ MIN and less than or equal to DTWIN _ MAX, the window switching is judged by using the reading error rate and the available block limit; when the time of the set in the DTWIN is longer than the DTWIN _ MAX, window switching from the DTWIN to the NDWIN is forced; when the time of the set in the NDWIN is less than or equal to NDWIN _ MAX, the original judgment of ending the garbage recycling operation is used for carrying out switching judgment; when the time that the set is in NDWIN is greater than NDWIN _ MAX, the time window of the set is forced to switch to DTWIN.
4. The NVMe-oriented IO deterministic optimization strategy method of claim 1, wherein the link b comprises a scheduling algorithm that employs the following steps: b1 An entry function that replaces the original pblk; b2 Universal block level calls the entry function to pass into bio; b3 The scheduler obtains the window attribute of the set to which the scheduler belongs; b4 When the window is in the determined time window, the original dispatch logic is used for dispatching, and when the window is in the undetermined delay window, the generic _ make _ rq () of the universal block layer is called to inform the universal block layer to schedule the next request; b5 Loop over b2-b4 when the new bio reaches the generic block layer; b1 is the step performed during initialization of the pblk module, and b2-b4 are continuous cyclic processes during operation of the pblk module.
5. The NVMe-oriented IO deterministic optimization strategy method of claim 1, wherein the link c comprises a cache management algorithm employing the following steps: c1 Ring buffer initialization, determining buffer size according to the number of LUNs, initializing buffer entry addresses, and data writing and flushing pointers; c2 After the write request is written into the annular cache, judging the window where the set is located; c3 When the set is in the determined time window, the cache normally wakes up the write thread to flush the data in the cache, and when the set is in the undetermined time window, the wake-up of the write thread is suspended to avoid the call of the garbage recovery thread to the write thread.
CN202011014697.XA 2020-09-24 2020-09-24 NVMe-oriented IO deterministic optimization strategy method Active CN112559381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011014697.XA CN112559381B (en) 2020-09-24 2020-09-24 NVMe-oriented IO deterministic optimization strategy method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011014697.XA CN112559381B (en) 2020-09-24 2020-09-24 NVMe-oriented IO deterministic optimization strategy method

Publications (2)

Publication Number Publication Date
CN112559381A CN112559381A (en) 2021-03-26
CN112559381B true CN112559381B (en) 2022-10-11

Family

ID=75041144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011014697.XA Active CN112559381B (en) 2020-09-24 2020-09-24 NVMe-oriented IO deterministic optimization strategy method

Country Status (1)

Country Link
CN (1) CN112559381B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089320B (en) * 2022-08-31 2023-10-20 荣耀终端有限公司 Garbage recycling method and related device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416162B2 (en) * 2017-09-27 2022-08-16 Beijing Memblaze Technology Co., Ltd Garbage collection method and storage device
CN109977032B (en) * 2017-12-28 2024-08-09 北京忆恒创源科技股份有限公司 Garbage data recovery control method and device
CN111580754B (en) * 2020-05-06 2021-07-13 西安交通大学 Write-friendly flash memory solid-state disk cache management method

Also Published As

Publication number Publication date
CN112559381A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
US20220091740A1 (en) Memory devices and electronic systems having a hybrid cache including static and dynamic caches, and related methods
KR101297563B1 (en) Storage management method and storage management system
CN107885456B (en) Reducing conflicts for IO command access to NVM
TWI475561B (en) Memory system
Lee et al. FlexFS: A Flexible Flash File System for MLC NAND Flash Memory.
KR101498673B1 (en) Solid state drive, data storing method thereof, and computing system including the same
CN102713866B (en) Reduce based on the access contention in the storage system of flash memory
US9183136B2 (en) Storage control apparatus and storage control method
JP5437373B2 (en) Storage system having multiple flash packages
TW201942738A (en) Electronic device, computer system, and control method
CN111694510B (en) Data storage device and data processing method
US8645612B2 (en) Information processing device and information processing method
CN109471594B (en) M L C flash memory read-write method
CN103838676B (en) Data-storage system, date storage method and PCM bridges
US8914587B2 (en) Multi-threaded memory operation using block write interruption after a number or threshold of pages have been written in order to service another request
KR101835604B1 (en) Scheduler for memory
CN112559381B (en) NVMe-oriented IO deterministic optimization strategy method
CN107885667B (en) Method and apparatus for reducing read command processing delay
Wu et al. Dualfs: A coordinative flash file system with flash block dual-mode switching
Xie et al. Pinpointing and scheduling access conflicts to improve internal resource utilization in solid-state drives
KR101549569B1 (en) Method for performing garbage collection and flash memory apparatus using the method
US11372560B2 (en) Utilization-efficient resource allocation
CN108572924B (en) Request processing method of 3D MLC flash memory device
CN107766262B (en) Method and device for adjusting number of concurrent write commands
JP2014112377A (en) Storage system with plurality of flash packages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant