CN116364148A

CN116364148A - Wear balancing method and system for distributed full flash memory system

Info

Publication number: CN116364148A
Application number: CN202210471090.7A
Authority: CN
Inventors: 魏征; 马一力; 邢晶; 谭光明
Original assignee: Lenovo Beijing Ltd; Institute of Computing Technology of CAS
Current assignee: Lenovo Beijing Ltd; Institute of Computing Technology of CAS
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2023-06-30

Abstract

The invention provides a wear leveling algorithm for a distributed full flash memory system, which realizes the multilevel wear leveling algorithm of the memory system through four levels of wear leveling among nodes, wear leveling among devices in the nodes, wear leveling among blocks in the devices and wear leveling among using blocks and idle blocks in the devices. The wear balance of each device in the storage system is balanced through a four-level wear balance algorithm, the service life of the device is prolonged to the maximum extent, the maintenance cost of the storage system is saved, and the data reliability is ensured.

Description

Wear balancing method and system for distributed full flash memory system

Technical Field

The invention relates to the field of data storage and erasure codes oriented to a distributed full flash memory system, in particular to a wear balancing algorithm among storage devices oriented to the distributed full flash memory system.

Background

In the big data age, in the design of a storage system facing to high-performance computing, the storage system needs to provide high concurrency, high bandwidth and low latency IO capability. The traditional mechanical disk cannot provide high-efficiency access bandwidth and time delay, and particularly, high-concurrency OpenMPI parallel computing application is required to improve higher continuous writing capability and lower response time delay, on the other hand, with the progress of technology and integrated circuit technology, solid-state storage devices such as NVMe SSDs are increasingly cheaper, NVMe SSDs gradually replace the traditional mechanical disk, and more full flash products are marketed. The use of SSD devices in storage systems instead of rotating machinery introduces life issues while achieving the benefits of high bandwidth, low latency IO.

SSDs use semiconductors as storage media, including DRAM, NOR Flash, NAND Flash, and the like. Now, it is relatively common to use NAND Flash as an SSD of a storage medium. The data of NAND Flash is stored by floating gate cells, each of which may store 1, 2 or 3 bits depending on the different storage medium (SLC, MLC, TLC). The floating gate units form a logic page, and the logic page forms a logic block. One page consists of 4kB of data storage space and 128B of ECC check data storage space, and 64 pages again constitute one block. The page is the smallest read-write addressing unit in NAND Flash and the block is the smallest erasable unit in NANDFlash. The data in the page can be read or written to a blank page at will, but the data already written to the page cannot be rewritten, and if the rewriting requires erasing the data of the entire block, then new data can be rewritten.

This read-write mechanism of NAND Flash results in its read-write interface not using the usual read and write interfaces, but requiring the addition of an erase function in addition to the read-write. A conversion layer FTL needs to be added between Flash and a traditional Interface and is positioned between a Host Interface and a NAND Interface of an SSD and is responsible for conversion between 2 interfaces. The main functions of FTL are address mapping, wear leveling, garbage collection, bad block management, etc. Address mapping is the mapping of logical addresses to physical addresses by the FTL using logical address location data for the host. When data is modified, we write the data to a new page and modify the address mapping recorded in FTL to avoid redundant data reading and writing. Wear leveling is that the writable times of each storage unit in Flash are limited, and in order to prolong the service life of the whole device, the write times of each unit are ensured to be uniform as much as possible. The FTL records the remaining life of each block, as long as possible. At the same time, some data stored in blocks with a relatively long remaining life, but not changed frequently, is also moved to blocks with a relatively short remaining life. Garbage collection is in NAND Flash, with block unit erasure, the dead page space is not released immediately. And garbage collection is responsible for releasing the failed pages, so that the space utilization rate is ensured. It copies the valid pages in one block into the other and then erases the entire block. Bad block management records blocks that have not been used, avoiding when writing data.

One key component of SSDs, in addition to storage media and controllers, is the host interface. Manufacturers, typified by Intel, jointly developed a protocol NVMe designed specifically for PCIe SSDs. The NVMe protocol is specially designed for SSD, and maximally supports 64k queues, each queue can accommodate 64k instructions, the instructions are read and written in a multi-queue and doorbell mode, the queues do not need to be locked, and the performance of high-performance SSD equipment can be fully exerted. The PCIe interface solves the hardware path bottleneck problem of the SSD host interface, and the NVMe protocol solves the software protocol bottleneck problem of the SSD host interface. The SSD device provides a SATA interface and a PCIe interface, the SATA interface is limited to a SATA bus, the transmission speed is about 600MB/s, the PCIe interface provides an NVMe protocol, and the NVMe SSD based on the NVMe protocol can provide read-write bandwidth of 3GB/s and access time delay of 10us level through an SPDK user interface.

The distributed full-flash memory system constructed based on SSD needs to construct a wear-leveling algorithm with multiple dimensions, and the wear leveling of each node, each device and the inside of the device is guaranteed. The SPDK user mode protocol can avoid the overheads such as a CPU (Central processing Unit), a protocol stack and the like, so that a developer can fully exert the performance of bare equipment, but the wear balancing algorithm in the SSD can only achieve the purpose of wear balancing in the equipment, and cannot achieve the wear balancing control of a system level.

Disclosure of Invention

In an actual distributed full flash memory operation environment, there are operations such as data reading, writing, updating and deleting, wherein a data writing operation occurs to a block in a storage device, a writing operation to SSD storage grains involves a erasing operation, writing can be directly performed to grains without data, an advanced erasing operation is required only to grains with data, the erasing frequency of SSD storage grains is limited, and when the erasing frequency reaches an upper limit, the grains are written through, and data cannot be stored any more, so that the data is lost or the storage device is damaged. The data update occurs in a read-after-change operation of a portion of the data, which is also a write operation, again requiring a copy operation before writing to a new block. The deleting operation identifies that the block is invalid, and the data stored in the block becomes invalid, and the data writing can be performed again after the deleting operation. Blocks involved in data writing, updating and deleting are all erased, and the life of the nand blocks is reduced once due to the erasing, so that a wear balancing technology needs to be provided to ensure relatively balanced life of storage devices in the whole system as much as possible, and data reliability is improved.

Specifically, the invention provides a wear-leveling algorithm for a distributed full flash memory system, which realizes a multi-level wear-leveling algorithm of a memory system by four levels of wear leveling among nodes, wear leveling among devices in the nodes, wear leveling among blocks in the devices and wear leveling among used blocks and idle blocks in the devices, and comprises the following steps:

step 1, the distributed full flash memory system comprises a client, a memory server and a metadata server, wherein the client initiates a write request, and a layout memory server corresponding to layout information of a data block is determined according to a hash value of the write request;

step 2, the layout storage server distributes a storage server with the lowest equipment abrasion condition among all storage servers as a data runner storage server of the data block according to the load information and the equipment abrasion condition of all storage servers;

step 3, the client sends a write request to the data owner storage server so as to write the data block into the data owner storage server, the data owner storage server locally updates the wear record, and synchronizes the local wear record to the metadata server when sending a heartbeat request to the metadata server, and updates the locally stored wear information of all node equipment according to the return value of the metadata server;

and the step 2 comprises: and if the storage servers with the lowest equipment abrasion condition exist in all the storage servers, selecting the storage server which is not allocated for the longest time from the storage servers with the lowest equipment abrasion condition as the data runner storage server.

The wear leveling method for the distributed full flash memory system comprises the following steps of

The storage server records the writing abrasion condition of each local SSD device and each data block, and records the mapping relation between the data blocks and the SSD devices; a data structure is built for each SSD device to record a writing block and an idle block, and the writing times of all the blocks are recorded;

the step 3 comprises the following steps:

step 31, the data runner storage server determines whether there is only one SSD device with the least number of writing blocks, if yes, the data block is written into the SSD device with the least number of writing blocks, otherwise, the data block is written into the SSD device with the least number of writing times in all the SSD devices with the least number of writing blocks.

According to the wear balancing method for the distributed full flash memory system, the data blocks are classified and managed in the SSD device according to whether the data are stored or not, the data blocks for storing the effective data are organized into work_list and free_list, the work_list is used for organizing the data blocks for storing the effective data according to ascending order of data writing times, and the free_list is used for organizing free blocks without data according to writing times from small to large;

this step 31 comprises:

in step 311, when the data block is written into the SSD device, a block with the least writing times in the free_list of the SSD device is selected for writing, and the block is inserted into the work_list of the SSD device according to the writing times.

The wear leveling method for the distributed full flash memory system, wherein the step 311 includes:

step 3111, recording time of inserting the current data block into the work_list in the work_list, access times and last access time; periodically polling the data blocks in the work_list and the free_list, counting the total writing times of the data blocks in each queue, and exchanging the block with the least writing times in the work_list and the least accessed block with the most writing times in the free_list when the average writing times of the data blocks in the work_list are larger than the average writing times of the data blocks in the free_list or when the data blocks exist in the work_list and are not accessed for a long time and the writing times of the data blocks are smaller than the average writing times of the block in the free_list.

The invention also provides a wear leveling system for the distributed full flash memory system, wherein the distributed full flash memory system comprises a client, a storage server and a metadata server, and the wear leveling system comprises:

the request initiating module is used for enabling the client to initiate a write request, and determining a layout storage server corresponding to layout information of the data block according to the hash value of the write request;

the first wear balancing module is used for distributing a storage server with the lowest equipment wear condition among all storage servers as a data runner storage server of the data block according to the load information and equipment wear condition of all storage servers;

the request execution module is used for enabling the client to send a write request to the data runner storage server so as to write the data block into the data runner storage server, wherein the data runner storage server locally updates the wear record, synchronizes the local wear record to the metadata server when sending a heartbeat request to the metadata server, and updates all locally stored node equipment wear information according to the return value of the metadata server;

and the first wear leveling module is further to: and if the storage servers with the lowest equipment abrasion condition exist in all the storage servers, selecting the storage server which is not allocated for the longest time from the storage servers with the lowest equipment abrasion condition as the data runner storage server.

The wear leveling system for the distributed full flash memory system comprises the following components

the request execution module includes:

and the second wear balancing module is used for enabling the data runner storage server to judge whether only one SSD device with the least number of writing blocks exists or not, if yes, writing the data blocks into the SSD device with the least number of writing blocks, and if not, writing the data blocks into the SSD device with the least number of writing times in all the SSD devices with the least number of writing blocks.

The wear leveling system for the distributed full flash memory system is characterized in that the data blocks are classified and managed in the SSD device according to whether the data are stored or not, the data blocks are organized into work_list and free_list, the data blocks for storing effective data are organized according to ascending order of data writing times, and the free_list is organized into idle blocks without data according to writing times from small to large;

the second wear leveling module includes:

and the third wear balancing module is used for selecting a block with the least writing times in the free_list of the SSD device to write when the data block is written into the SSD device, and inserting the block into the work_list of the SSD device according to the writing times.

The wear leveling system for the distributed full flash memory system, wherein the third wear leveling module comprises:

the fourth wear balancing module is used for recording the time of inserting the current data block into the work_list, the access times and the last access time in the work_list; periodically polling the data blocks in the work_list and the free_list, counting the total writing times of the data blocks in each queue, and exchanging the block with the least writing times in the work_list and the least accessed block with the most writing times in the free_list when the average writing times of the data blocks in the work_list are larger than the average writing times of the data blocks in the free_list or when the data blocks exist in the work_list and are not accessed for a long time and the writing times of the data blocks are smaller than the average writing times of the block in the free_list.

The invention also provides a storage medium for any program of the wear leveling method for the distributed full flash memory system.

The invention also provides a client, which is used for any wear balancing system facing the distributed full flash memory system.

The advantages of the invention are as follows:

the invention provides a wear leveling algorithm based on a four-stage selection algorithm. The method comprises the steps of ensuring equipment wear balance among nodes through a node level selection algorithm, ensuring the wear balance among equipment in the nodes through an equipment level selection algorithm, ensuring that blocks with minimum wear degree are always selected when data in the equipment are written through an equipment block level selection algorithm, and ensuring the overall wear balance through an exchange working block and an idle block through an equipment block exchange algorithm. And the wear balance of all blocks in the storage system is guaranteed through a four-level scheduling algorithm, and the service life is maximized.

Drawings

FIG. 1 is a diagram of a distributed full flash memory system architecture;

FIG. 2 is a schematic diagram of a node level data placement selection algorithm implementation based on consistent hashing;

FIG. 3 is a schematic diagram of an intra-device block level data selection method implementation.

Detailed Description

In the process of data read-write and deleting in the distributed full flash memory system, the following points are needed:

1. in the data writing process, the wear balance of node-level equipment among nodes is required. The load balance of each node space is guaranteed, the balance of each node space is guaranteed, and the wear balance of the node level is guaranteed through balancing the load.

2. In the process of data writing, wear balance among storage devices in the node is required. Within the storage nodes, each node is equipped with a multi-block storage device, either a SATA SSD or an NVMe SSD. By balancing the load and use of each SSD, wear balance among storage devices is ensured. Each SSD ensures the load balance among devices by ensuring the balance of the stored data quantity.

3. In the data writing process, wear balance in the device needs to be achieved. In the storage device, the write-in times of each block are balanced by managing the write-in times of each block, so that the wear balance of block levels in the storage device is ensured.

4. During operation of the storage system, wear leveling of the blocks within the storage device needs to be ensured. And through the block-level exchange in the equipment, the wear balance of each block is ensured.

In summary, the invention provides a wear leveling algorithm for a distributed full flash array, which realizes a multi-stage wear leveling algorithm of a storage system by four stages of wear leveling among nodes, wear leveling among devices in the nodes, wear leveling among blocks in the devices and wear leveling among using blocks and space blocks. The wear balance of each device in the storage system is balanced through a four-level wear balance algorithm, the service life of the device is prolonged to the maximum extent, the maintenance cost of the storage system is saved, and the data reliability is ensured.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The distributed full flash memory system constructed based on the NVMe SSD is shown in fig. 1, and the distributed full flash file system cluster constructed based on the NVM and the NVMe SSD includes a metadata server MDS, an object data server OSD and a CLIENT. Wherein each node contains a plurality of NVMe SSD storage devices. When the distributed full-flash memory system is designed, the design is required for the whole architecture of the full-flash machine group, and the design is required for node level, equipment level, block level and block exchange, so that the whole wear balance is achieved.

The metadata server is responsible for managing file directory management of a file system, state management of each node, position information and the like, the object data server is responsible for storing data, providing read-write service of the data and the like, and in the invention, the object data server is responsible for wear balance and the client side provides an access interface for a user. Wherein wear leveling in the present invention refers specifically to wear leveling of an object data server having an SSD storage device.

The wear-leveling algorithm for the design of the distributed full flash memory system comprises the following steps:

1. node level data placement selection algorithm:

the distributed full flash memory system object storage server OSD has two roles, namely a layout runner and a data runner, wherein the layout runner stores the placement information of partial data blocks (a plurality of layout runners store the placement information of all the data blocks based on a consistent hash algorithm and can avoid data migration when nodes are added and deleted), and the layout runner position is determined by calculating a hash key through an ino number (an inode number of a file in a storage system) of a file in which the data blocks are located and a data address; the data owner stores file data block content, and the data owner is determined by the layout owner according to the load condition and the equipment abrasion condition of each node in the cluster. The layout of each OSD node logically forms a consistent hash ring structure, and each layout is responsible for a section of layout storage of the hash key range. When the OSD node is changed, only the layout information of the neighboring node needs to be adjusted, and data migration is not needed.

The process of selecting OSD nodes in one IO process is shown in fig. 2:

a) The client calculates a Hashkey through the ino and the block address, determines the layout information of the current data block according to the Hashkey, stores the layout information in the OSD2, and sends address information to the OSD2 to acquire a Getloc request.

B) And the OSD2 distributes optimal data wner for the current access data block according to the load information and equipment abrasion condition of all OSD nodes of the current cluster, stores the optimal data wner in a local area and returns the optimal data to the client. The data wner principle is assigned:

i. node priority allocation with lowest wear of equipment in cluster

Selecting a node priority allocation different from the last allocation among a plurality of nodes with lowest wear

if either a read access or an update write, then a layout has been allocated. Looking up layout information directly locally at layout scanner

C) The client obtains data window returned by Getloc as OSD3, sends a read-write request to OSD3, and completes the IO operation of the file.

D) If the request is a write request, the equipment abrasion record of the OSD3 is updated locally, and each time the OSD sends a heartbeat request to the metadata server MDS, the local abrasion record is synchronized to the MDS, and then all locally stored node equipment abrasion information is updated according to the return value of the MDS, so that a reserved judgment basis is selected for the subsequent nodes.

2. Intra-node device level data placement selection algorithm:

in the running process of the file system, the OSD records the writing abrasion condition of each local SSD device and each data block, and records the mapping relation between the blocks and the SSD device. When data is written, proper equipment is selected through an equipment-level data placement selection algorithm in the node, mapping between the block and the equipment is stored and recorded, and when the data is read, the corresponding storage equipment is obtained through inquiring the mapping relation between the block and the equipment.

The device-level data placement selection algorithm in the node comprises the following steps:

a) Each device builds a data structure to record the writing blocks and the idle blocks of each device, and records the writing times of all the blocks;

b) When writing data, proper equipment is needed to be selected for writing:

a) Firstly, selecting a storage device with few writing blocks in the device as a device for writing data blocks;

b) For a plurality of devices with the same blocks, judging the writing times of the devices, and selecting the device with the least writing times as the device for writing the data blocks;

c) For a plurality of devices with the same writing times, judging the average value of the writing times, and selecting the device with the smallest average value as the device for writing the data block;

d) If the average values are the same, selecting a storage device as a device c) for writing the data block according to a polling mechanism, inquiring the corresponding device, reorganizing the deleted data block from the device, and changing the available capacity of the device;

3. intra-device block-level data placement selection algorithm:

as shown in fig. 3, the intra-device block-level data selection algorithm classifies and manages data blocks according to whether to store data, organizes the data blocks into a work_list and a free_list, organizes the data blocks storing effective data according to ascending order of data writing times, and organizes free blocks without data according to writing times from small to large.

When data is written, writing is carried out to the inside of the equipment according to the selected equipment, a block with the least writing times in the free_list is selected for writing, and the block is inserted into the work_list according to the writing times.

4. Intra-device block exchange selection algorithm:

some data is deleted after a period of time when the data is written into the storage device, and the written blocks are also put into the free_list again, but some blocks are valid for a long time. This results in that some blocks of the data which are permanently valid are not replaced, while other blocks of the data are worn to an increased extent due to frequent data movement, so that it is necessary to exchange blocks of the work_list which are not accessed for a long time with blocks of the free_list which are severely worn, according to the actual situation.

The method comprises the following steps:

a) Recording the time of inserting the current block into the work_list, the access times and the last access time in the work_list;

b) Periodically polling the blocks in the work_list and the free_list, counting the total writing times of the blocks in each queue, and exchanging the block with the least writing times in the work_list and the least accessed block with the most writing times in the free_list when the average writing times of the blocks in the work_list are larger than the average writing times of the blocks in the free_list or when some blocks exist in the work_list and are not accessed for a long time and the writing times of the blocks are smaller than the average writing times of the blocks in the free_list.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

the request execution module includes:

the second wear leveling module includes:

Claims

1. The wear leveling method for the distributed full flash memory system is characterized by comprising the following steps of:

2. The method for wear leveling of a distributed full flash memory system as in claim 1,

the step 3 comprises the following steps:

3. The wear leveling method for the distributed full flash memory system according to claim 2, wherein the SSD device internally classifies and manages data blocks according to whether to store data, and organizes the data blocks storing effective data into work_list and free_list according to ascending order of data writing times, and the free_list organizes free blocks without data according to writing times from small to large;

this step 31 comprises:

4. The wear leveling method for a distributed full flash memory system as claimed in claim 3, wherein said step 311 comprises:

5. A wear leveling system for a distributed full flash memory system, the distributed full flash memory system comprising a client, a storage server, and a metadata server, the wear leveling system comprising:

6. The wear leveling system for a distributed full flash memory system of claim 5,

the request execution module includes:

7. The wear leveling system for the distributed full flash memory system as claimed in claim 6, wherein the SSD device internally classifies and manages the data blocks according to whether to store the data, and organizes the data blocks storing the effective data into a work_list and a free_list according to ascending order of data writing times, and the free_list organizes free blocks without data according to the writing times from small to large;

the second wear leveling module includes:

8. The wear leveling system for a distributed full flash memory system in accordance with claim 7, wherein said third wear leveling module comprises:

9. A storage medium storing a program for executing the wear leveling method for a distributed full flash memory system according to any one of claims 1 to 4.

10. A client for a wear leveling system for a distributed full flash memory system according to any one of claims 5 to 8.