WO2021208239A1 - 一种低延迟的文件系统地址空间管理方法、系统及介质 - Google Patents

一种低延迟的文件系统地址空间管理方法、系统及介质 Download PDF

Info

Publication number
WO2021208239A1
WO2021208239A1 PCT/CN2020/097671 CN2020097671W WO2021208239A1 WO 2021208239 A1 WO2021208239 A1 WO 2021208239A1 CN 2020097671 W CN2020097671 W CN 2020097671W WO 2021208239 A1 WO2021208239 A1 WO 2021208239A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
data
block
data block
block group
Prior art date
Application number
PCT/CN2020/097671
Other languages
English (en)
French (fr)
Inventor
陈志广
卢宇彤
肖侬
Original Assignee
中山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学 filed Critical 中山大学
Priority to US17/638,196 priority Critical patent/US11853566B2/en
Publication of WO2021208239A1 publication Critical patent/WO2021208239A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device

Definitions

  • the present invention belongs to the field of large-scale storage file systems. Specifically, it relates to a low-latency file system address space management method, system and medium. By adopting a new data structure to manage the physical address space of the file system, the physical address space of the file system is significantly reduced. File read and write delay.
  • the local file system for storage devices is the foundation of all data storage and management systems. For example, mainstream databases are built on file systems. Distributed file systems must rely on local file systems to organize data on devices. Big data storage systems represented by Hbase and Dynamo must be built on distributed file systems. Or directly call the local file system to read and write data on the storage device. In short, the performance of the local file system has a vital impact on all data storage and management systems.
  • the local file system can be roughly divided into two components in structure: name space management and address space management.
  • name space management is to maintain the directory structure and provide users with an operating interface for the file system.
  • Address space management is mainly responsible for the organization of data on storage devices. Generally speaking, no matter what technical means are used to implement the storage device, the device manufacturer will abstract the physical address space of the device as a linear logical address space, and the address space management module is responsible for organizing user data and file system metadata in a linear logical Address space.
  • each file is abstracted into a linear byte stream, which is divided into fixed-length blocks.
  • the Ext4 file system divides files into 1KB by default. And optionally allow users to divide the file into 2KB or 4KB blocks.
  • the logical address space of the underlying storage device is also divided into fixed-size blocks.
  • the minimum read and write unit of the disk is 512 bytes, but the file system can format the disk into blocks of 1KB, 2KB, 4KB, etc. .
  • An important function of file system address space management is to maintain the correspondence between the blocks of the file byte stream and the blocks of the logical address space of the storage device.
  • a 1MB file can be divided into 1024 1KB blocks. Assuming that the underlying storage device is also divided into 1KB blocks, the file needs to occupy 1024 blocks on the storage device. The correspondence between the 1024 blocks it occupies on the storage device is maintained by the address space management module.
  • the early FAT file system organized data blocks on the disk in the form of link pointers, that is, at the end of each data block there is a pointer to the next data block, and all data blocks are linked by the pointer.
  • link pointers that is, at the end of each data block there is a pointer to the next data block, and all data blocks are linked by the pointer.
  • This data block organization is logically simple but inefficient. For a file, no matter which data block is read, it needs to be searched sequentially from the first data block. The read and write performance gradually decreases as the file becomes larger.
  • Ext series file systems (such as Ext2, Ext3, Ext4) maintain special pointers to record data blocks occupied by files.
  • 15 pointers are reserved in the inode of each file, 12 of which are used to directly point to the data block. These 12 pointers are called direct pointers. If the data blocks pointed to by these 12 pointers are not enough to store the data corresponding to the file (ie, the file size exceeds 12 data blocks), the 13th pointer is enabled, which is called a one-time indirect pointer.
  • An indirect pointer points to a data block on the storage device. The data block does not store user data, but a pointer to other data blocks.
  • 1024 pointers can be stored in the data block pointed to by an indirect pointer. These pointers point to 1024 data blocks, thereby significantly increasing the file system The size of a single file that can be supported. If the primary indirect pointer still cannot meet the needs of the file, that is, the size of the file exceeds 1036 (1024+12) data blocks, the 14th pointer is enabled, which is called the secondary indirect pointer.
  • the secondary indirect pointer points to a data block on the storage device. The data block does not store user data, but a large number of primary indirect pointers.
  • each data block can contain 1024 pointers.
  • each secondary indirect pointer can index 1024 primary indirect pointers, and each primary indirect pointer can index 1024 data blocks.
  • the largest file supported by the file system can be increased by 1024 ⁇ 1024 data again.
  • the 15th pointer in the inode can be enabled, that is, the tertiary indirect pointer.
  • the data block pointed to by the tertiary indirect pointer contains 1024 secondary indirects.
  • the file supported by the Ext series file system contains up to 12+1024+1024 ⁇ 1024+1024 ⁇ 1024 ⁇ 1024 data blocks. If each data block is 4KB, it can save 4TB in size. For a single file, the above storage capacity can meet application requirements in most scenarios.
  • BtrFS uses a B+ tree to index all data blocks occupied by each file, that is, a B+ tree is created for each file, and the data blocks occupied by the file are inserted into the corresponding B+ tree. Search in the B+ tree until the corresponding data block is found.
  • the above technology is also difficult to deal with the scene of large files. As the file becomes larger, the corresponding B+ tree becomes deeper and deeper, and the delay in searching for the target data block in the B+ tree becomes larger and larger, resulting in a decrease in file read and write performance.
  • the Ext4 file system proposes the concept of Extent, which reduces the index overhead of data blocks and avoids multiple indirect pointers (such as secondary indirect pointers and tertiary indirect pointers).
  • the Extent is variable-length, and specifically can be an integer multiple of the fixed-length data block.
  • a small but long Extent can be used to record the data block occupied on the storage device. Because the Extent is relatively long, a small amount of Extent can be used to represent a large file, and the corresponding data block can be found with only a small retrieval overhead when reading and writing the data block of the file.
  • the current file system is difficult to meet the low-latency requirements of data access when facing large files, and data access delay is a key factor affecting application performance in many scenarios.
  • many applications merge a large number of small files into one large file and save it in the file system.
  • the random access of these small files by the upper application becomes the random access to the different offset addresses of the large file.
  • the random access to the large file Access delay has become a key factor affecting application performance.
  • applications often issue discontinuous, fixed-interval read and write requests for large files. In this scenario, applications have high data access latency (rather than bandwidth). Require. Facing the above application scenarios, it is of great significance to design a low-latency local file system.
  • the technical problem to be solved by the present invention in view of the above-mentioned problems in the prior art, a low-latency file system address space management method, system and medium are provided.
  • the present invention can ensure that only one IO operation can know that a file is in the storage device. Which data blocks are occupied by the file system, instead of requiring up to 4 IO operations to read indirect pointers like the Ext file system, this optimization measure can significantly reduce the file read and write latency, and can significantly reduce the addressing overhead of file read and write , Can significantly improve the sequence of file read and write, thereby improving read and write performance.
  • a low-latency file system address space management method the implementation steps include:
  • the address space of the storage device is generated into a super block and a block group allocation table, the super block stores file system information and the allocation of the block group in the linear address space of the storage device, and the block group allocation table is used to mark the corresponding The allocation of data blocks in the block group;
  • the size of the designated data block is an integer power of 2 KB.
  • the detailed steps of dynamically creating or selecting a corresponding block group and allocating data blocks according to the specified data block size in step 2) include:
  • A1) Determine whether there is still a block group with a specified data block size for free data blocks, if there is a block group with the specified data block size, use the block group as the target block group, if not, create a new block group with the specified data block size, and set the new block group As a target block group;
  • step A3) Judge whether the file has been written, if the file has not been written, skip to step A1); otherwise, if the file has been written, end and exit.
  • the detailed steps of dynamically creating or selecting a corresponding block group and allocating data blocks according to the specified data block size in step 2) include:
  • step B3 For the created file, allocate data blocks in the target block group and record the number of allocated data blocks, write file data in the allocated data block and update the block group allocation table and the information of the super block, and when the allocated data block is allocated When it is less than the preset threshold n, jump to step B4);
  • Step B4 Determine whether the file has been written, if it has not been written, select the adjacent data block size larger than the current data block size a i from the data block size a 0 to a m as the new current data block size a i , jump to step B2); otherwise, if the file has been written, it ends and exits.
  • step B4) the step of calculating the file size FileSize is also included, and the calculation function expression of the file size FileSize is shown in the following formula:
  • a 0 is the specified minimum data block size
  • N is the total number of data blocks occupied by a file
  • n is the number of data blocks allocated for a file in each type of data block.
  • step 2) it also includes the step of reading the file or overwriting the file: first obtain through an IO operation and read the data block pointer table corresponding to the file from the storage device into the memory, and the data block pointer table is the file A table composed of block numbers corresponding to the occupied N data blocks.
  • the data block pointer table is stored in the index node information of the file system; and then the data block where the data of the file read file or overwrite file is calculated according to the data block pointer table , Read or write data from the calculated data block.
  • step 2) it also includes the step of adding a file:
  • step C2) Determine whether the remaining space of the last data block of the N data blocks occupied by file f is greater than or equal to the write data length l of the additional write request, if it is true, skip to step C3); otherwise, skip to step C4);
  • step C6 Increase the number N of data blocks occupied by file f by 1, and jump to step C2).
  • the expression for calculating the remaining space of the last data block among the N data blocks occupied by the file f is:
  • N represents the total number of data blocks occupied by a file
  • n represents the number of data blocks allocated for a file in each type of data block
  • L represents the existing length of the file f.
  • the present invention also provides a low-latency file system address space management system, including a computer device that includes at least a microprocessor and a memory, and the microprocessor of the computer device is programmed or configured to execute the low-latency
  • a computer device that includes at least a microprocessor and a memory
  • the microprocessor of the computer device is programmed or configured to execute the low-latency
  • the steps of the file system address space management method described above, or a computer program programmed or configured to execute the low-latency file system address space management method is stored in the memory of the computer device.
  • the present invention also provides a computer-readable storage medium storing a computer program programmed or configured to execute the low-latency file system address space management method.
  • the present invention has the following advantages:
  • the address space management method proposed by the present invention can significantly reduce the addressing overhead of file reading and writing.
  • the file system needs to determine the data block of the data read and written on the storage device according to the offset. This process is called the addressing process. .
  • the addressing process of the Ext file system is the process of finding multi-level indirect pointers, and the addressing process of BtrFS is the process of searching the B+ tree. According to the above analysis, the above process requires multiple IO operations, which will significantly increase the latency of reading and writing data.
  • the present invention reduces the total number of data blocks contained in a file by using variable-length data blocks, and controls the total number of data blocks contained in the file within hundreds, so that these data block pointers can be removed from the storage device through a single IO operation.
  • Reading up significantly reduces the IO operations in the addressing process. Furthermore, because the data block pointers occupy less space, the file system can even save these data block pointers in the inode of the file, and read these pointers from the storage device when reading the inode information, thereby avoiding addressing IO operations in the process. In a word, the file system designed by the present invention can significantly reduce the IO operation in the addressing process, thus reducing the file read and write delay.
  • the present invention can significantly improve the sequence of file reading and writing, thereby improving the reading and writing performance.
  • Traditional file systems generally use fixed-length data blocks, such as 4KB data blocks, which will cause a file to occupy a large number of data blocks. For example, a 4GB file will occupy one million 4KB data blocks.
  • the file system adopts many optimization measures to improve the continuity of data blocks belonging to the same file on the storage device, there may be multiple applications issuing concurrent write operations to the file system during the same period of time. Continuous storage on storage devices is very difficult. For most storage devices, continuous read and write performance is significantly better than random read and write. Improving the continuity of file storage is critical to performance optimization.
  • the invention ensures that hundreds of data blocks can store files of any size, so the continuity of file storage is significantly better than that of traditional file systems, so that the continuous read and write advantages of the storage device can be fully utilized.
  • Fig. 1 is a schematic diagram of the basic flow of a method according to an embodiment of the present invention.
  • Figure 2 shows the address space distribution of storage devices in an embodiment of the present invention.
  • Fig. 3 is a schematic diagram of the process of adding a file in an embodiment of the present invention.
  • the key to the implementation of the present invention is how to organize data blocks in the linear address space of the storage device and how to read and write data from the file system.
  • reading the data of a file is relatively simple, and only needs to calculate the data block that should be read according to the offset of the read data in the file and the data block pointer table of the file.
  • Overwriting is similar to reading data, and the implementation of these two situations will not be discussed in detail here. The following focuses on the organization of data blocks and the implementation of additional writing.
  • the implementation steps of the low-latency file system address space management method of this embodiment include:
  • the address space of the storage device is generated into a super block and a block group allocation table, the super block stores file system information and the allocation of the block group in the linear address space of the storage device, and the block group allocation table is used to mark the corresponding The allocation of data blocks in the block group;
  • Traditional file systems generally format storage devices into fixed-length data blocks, and then divide a large number of data blocks into several block groups.
  • the Ext4 file system formats the storage device into optional 1KB, 4KB, and 8KB data blocks, and then determines the block group size according to the data block size, and finally divides the entire storage device into several block groups.
  • XFS also divides the storage space of the device into allocation groups, each of which is equivalent to an independent file system.
  • the common feature of the above file systems is that the length of the data block is fixed. In order to avoid the waste of space caused by small files, the size of the data block is generally set to be less than 8KB.
  • This embodiment proposes a method of using variable-length data blocks, that is, dividing the address space of the storage device into block groups.
  • the data blocks inside each block group are fixed-length, but different block groups can use data blocks of different sizes to create files.
  • the specified data block size dynamically create or select the corresponding block group and allocate the data block.
  • the data blocks in different block groups can be selected to be 1KB, 2KB, 4KB, 8KB, etc., to an integer power of 2, and the maximum data block can even be set to 512MB or 1GB according to requirements.
  • the file system does not need to establish a block group during initialization, but dynamically establishes a block group according to the requirements of the file during the process of receiving the user's creation of the file. Specifically, when creating a file, the user obtains data blocks from different block groups as needed. Once the data blocks in a block group are allocated, they need to create a block group with the same size data blocks again. Once a block group has been established on the storage device, its internal data block size is set and cannot be changed. However, after a block group is destroyed, the storage space it occupies can be re-allocated to other block groups, or a new block group can be created. At this time, the data block size in the block group can be reset.
  • the data block in the block group can be specified, the data block can be very large (unlike traditional file systems that generally use data blocks below 16KB), so that the file system can save a large file with fewer data blocks. Optimization measures ensure that large files do not need to use multiple indirect pointers to index data blocks, thereby avoiding multiple IO operations issued by traditional file systems to obtain indirect pointers when reading and writing large files.
  • Figure 2 shows the method of dividing the address space of the storage device in this embodiment.
  • the superblock mainly records the file system information.
  • the present invention also records the allocation of chunks in the linear address space of the storage device in the super block.
  • each block group allocation table is 4KB.
  • a large number of block groups follow the block group allocation table, where each block group corresponds to a previous block group allocation table.
  • the block group allocation table in this embodiment is used to identify the allocation of data blocks in the corresponding block group, and one bit in the block group allocation table is used to mark a data block in the block group. The position of 0 indicates that the corresponding data block is not allocated, and the position of 1 indicates that the corresponding data block has been allocated. Because a block group allocation table is 4KB and contains 32768 bits in total, each block group contains 32768 data blocks.
  • step 2) dynamically create or select the corresponding block group and allocate the data block according to the specified data block size.
  • a fixed specified data block size can be used.
  • step 2) the dynamic creation or selection according to the specified data block size
  • the detailed steps for selecting the corresponding block group and assigning data blocks include:
  • A1) Determine whether there is still a block group with a specified data block size for free data blocks, if there is a block group with the specified data block size, use the block group as the target block group, if not, create a new block group with the specified data block size, and set the new block group As a target block group;
  • step A3) Judge whether the file has been written, if the file has not been written, skip to step A1); otherwise, if the file has been written, end and exit.
  • step 2) the detailed steps of dynamically creating or selecting corresponding block groups and allocating data blocks according to the specified data block size include:
  • step B3 For the created file, allocate data blocks in the target block group and record the number of allocated data blocks, write file data in the allocated data block and update the block group allocation table and the information of the super block, and when the allocated data block is allocated When it is less than the preset threshold n, jump to step B4);
  • Step B4 Determine whether the file has been written, if it has not been written, select the adjacent data block size larger than the current data block size a i from the data block size a 0 to a m as the new current data block size a i , jump to step B2); otherwise, if the file has been written, it ends and exits.
  • the number of blocks is increasing.
  • a certain threshold assumed to be n
  • a 2KB data block can be allocated for it; as data is further written, the 2KB data block occupied by the file also exceeds the threshold n, then 4KB data can be allocated for it Block; and so on, as the file continues to increase, the size of the data block allocated to the file increases exponentially, but each type of data block is allocated only n.
  • the above-mentioned data block allocation principle ensures that when the number of data blocks occupied by a file increases linearly, the file size can increase exponentially, so that a large file can be indexed with fewer data block pointers.
  • the file size FileSize can be written in the form of a geometric sequence summation.
  • the step of calculating the file size FileSize is also included, and the calculation function expression of the file size FileSize is shown in the following formula:
  • a 0 is the specified minimum data block size
  • N is the total number of data blocks occupied by a file
  • n is the number of data blocks allocated for a file in each type of data block.
  • step 2) also includes the step of reading the file: first obtain the data block pointer table corresponding to the file from the storage device to the memory through an IO operation, and the data block pointer table is the N occupied by the file.
  • the data block pointer table is stored in the index node (inode) information of the file system; then the data block where the data of the file read file is located is calculated according to the data block pointer table, and the data block is obtained from the calculation Read the data in the data block.
  • the space occupied by the table does not exceed 512 bytes, so it can be read from the storage device to the memory through a single IO operation.
  • the file system may save the table in the inode information, and read the table together when reading the file inode, further reducing the IO operations required for reading user data.
  • the data block where the data at any offset of the file is located can be calculated, and then the user data can be directly read from the calculated data block.
  • step 2) it also includes the step of overwriting the file: first obtain the data block pointer table corresponding to the file from the storage device to the memory through an IO operation, and the data block pointer table is occupied by the file A table composed of block numbers corresponding to N data blocks, the data block pointer table is stored in the index node information of the file system; then the data block where the data of the overwritten file is calculated according to the data block pointer table, and the calculated data Overwrite in the block.
  • the first case is overwriting, that is, for a file, the original data is overwritten on the allocated data block. In this case, there is no need to allocate a new data block for the file. At this time, the write data is the same as the read data discussed above. Similarly, just calculate the target data block that should be written according to the data block pointer table.
  • the second case is additional writing, that is, more data is written at the end of a file. In this case, a new data block is required to be allocated to the file. As shown in Fig. 3, after step 2) in this embodiment, a step of additionally writing a file is included:
  • step C2) Determine whether the remaining space of the last data block of the N data blocks occupied by file f is greater than or equal to the write data length l of the additional write request, if it is true, jump to step C3); otherwise, jump to step C4);
  • step C6 Increase the number N of data blocks occupied by file f by 1, and jump to step C2).
  • the last data block occupied by file f may not be full of data. If the remaining space of the last data block is greater than 1, then the additional data written this time can be written to the file without applying for a new data block. Because file f occupies a total of N data blocks, the storage space occupied by it can be calculated as:
  • N represents the total number of data blocks occupied by a file
  • n represents the number of data blocks allocated for a file in each type of data block.
  • step C2) compares the value with l. If it is greater than l, it means that there is still enough free space.
  • step C4) because the last data block of the file f does not have enough space to receive data of length 1, only a part of the appended data can be written to the last data block.
  • the same for writing the last data block is:
  • the existing data blocks are n 1KB data blocks, n 2KB data blocks, n 4KB data blocks, etc., arranged in sequence, according to the data block allocation principle proposed in this embodiment .
  • the size of the data block that needs to be allocated for the next step is At this time, the data block size is Once the data block is successfully allocated, the data to be added can be written into the newly allocated data block. Therefore, in step C5) when a new data block is allocated for file f, the size of the allocated data block is Where N represents the total number of data blocks occupied by a file, and n represents the number of data blocks allocated for a file in each type of data block.
  • the data blocks in each block group are gradually consumed.
  • the data blocks in a block group are below a certain threshold, they need to be re-allocated with the same size
  • the block group of the data block At this time, a free space can be divided from the storage device, a new block group can be established, and the information of the block group can be recorded in the super block of the file system.
  • a file system When a file is deleted, the data block it occupies will also be recycled. If other files need to write data, these recovered data blocks will be allocated first. However, when the size distribution of files stored in the file system changes significantly, a large number of data blocks in a block group may be recycled, but will not be re-allocated to other files. For example, a file system saves a large number of small files in the early stage, which causes the file system to establish a large number of block groups containing small data blocks (such as block groups containing 1KB data blocks and block groups containing 2KB data blocks). Later, as the load of the upper-layer application changes, small files are gradually deleted, and a large number of large files are created instead.
  • the file system needs to create a large number of block groups containing large data blocks.
  • there may be no free space on the storage device to create a new block group which requires reclaiming part of the block group occupied by small data blocks to create a block group containing large data blocks.
  • the block group When the block group is reclaimed, first select two block groups that contain more free data blocks and have the same data block size. In these two block groups, assuming that there are relatively many free blocks in block group one, the effective data in block group one is migrated to block group two, because the migration overhead is small. If the valid data in block group one is migrated, the block group can be recycled and reformatted into other block groups.
  • the block group recovery mechanism ensures that the file system designed by the present invention can flexibly respond to load changes.
  • this embodiment also provides a low-latency file system address space management system, including a computer device that includes at least a microprocessor and a memory, and the microprocessor of the computer device is programmed or configured to perform the aforementioned low-latency The steps of the file system address space management method,
  • this embodiment also provides a low-latency file system address space management system, including a computer device that includes at least a microprocessor and a memory, and the memory of the computer device is programmed or configured to execute the aforementioned low-latency A computer program for a delayed file system address space management method.
  • this embodiment also provides a computer-readable storage medium that stores a computer program programmed or configured to execute the aforementioned low-latency file system address space management method.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application.
  • each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions.
  • These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种低延迟的文件系统地址空间管理方法、系统及介质,本发明方法包括:将存储设备的地址空间生成超级块和块组分配表,所述超级块存储有文件系统信息和块组在存储设备线性地址空间上的分配情况,所述块组分配表用于标记对应块组中数据块的分配情况;创建文件时根据指定数据块大小动态创建或选择对应的块组并分配数据块,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息。通过一次IO操作即可获知一个文件在存储设备上占据哪些数据块,能够显著降低文件的读写延迟,能够显著降低文件读写的寻址开销、能够显著提升文件读写的顺序性,从而提升读写性能。

Description

一种低延迟的文件系统地址空间管理方法、系统及介质 【技术领域】
本发明属于大规模存储的文件系统领域,具体而言,涉及一种低延迟的文件系统地址空间管理方法、系统及介质,通过采用新型数据结构用来管理文件系统的物理地址空间,以显著降低文件读写延迟。
【背景技术】
面向存储设备的本地文件系统是一切数据存储与管理系统的基础。例如,主流的数据库都建立在文件系统之上,分布式文件系统必须依托本地文件系统实现设备上的数据组织,以Hbase、Dynamo为代表的大数据存储系统要么构建在分布式文件系统之上,要么直接调用本地文件系统读写存储设备上的数据。总之,本地文件系统的性能对所有数据存储与管理系统都有着至关重要的影响。
本地文件系统从结构上大致可划分为两个组件:名字空间管理和地址空间管理。名字空间管理的主要功能是维护目录结构,为用户提供文件系统的操作接口。地址空间管理则主要负责数据在存储设备上的组织。一般说来,无论存储设备采用什么技术手段实现,设备厂商都会将设备的物理地址空间抽象为线性的逻辑地址空间,而地址空间管理模块则负责将用户数据和文件系统元数据组织在线性的逻辑地址空间上。
具体地,文件系统中保存着大量的文件,每个文件都被抽象为一个线性的字节流,该字节流被划分为固定长度的块,如:Ext4文件系统默认将文件划分为1KB大小的块,并以可选的方式允许用户将文件划分为2KB或4KB大小的块。相应地,底层存储设备的逻辑地址空间也被划分为固定大小的块,如:磁盘的最小读写单位为512字节,但文件系统可将磁盘格式化为1KB、2KB、4KB等大小的块。文件系统地址空间管理的一个重要作用就是维护文件字节流的分块与存储设备逻辑地址空间分块之间的对应关系。例如,一个1MB的文件可被划分为1024个1KB的分块,假定底层存储设备也被划分为1KB的块,则该文件在存储设备上需占据1024个分块,文件的1024个分块与它在存储设备上占据的1024个分块的对应关系,则由地址空间管理模块来维护。
不同的文件系统采用不同的地址空间管理方法。早期的FAT文件系统采用链接指针的形式在磁盘上组织数据块,即每个数据块的尾部有一个指向下一个数据块的指针,所有数据块通过指针链接起来。读写文件时,可以沿着链接指针顺次找到所有的数据块。这种数 据块组织方式逻辑简单但效率低下,对于一个文件,无论读取其中的哪个数据块,都需要从第一个数据块开始顺序查找,读写性能随着文件变大而逐步降低。
Ext系列文件系统(如Ext2、Ext3、Ext4)维护专门的指针来记录文件占据的数据块。每个文件的inode中预留15个指针,其中12个用于直接指向数据块,这12个指针被称为直接指针。如果这12个指针指向的数据块不足以保存该文件对应的数据(即:文件大小超过12个数据块),则启用第13个指针,该指针被称为一次间接指针。一次间接指针指向存储设备上的一个数据块,该数据块中保存的不是用户数据,而是指向其它数据块的指针。假定存储设备被文件系统格式化成4KB大小的数据块,每个指针占4字节,则一次间接指针指向的数据块中可以保存1024个指针,这些指针指向1024个数据块,从而显著增加文件系统能够支持的单个文件大小。如果一次间接指针仍然不能满足文件的需要,即文件的大小超过1036(1024+12)个数据块,则启用第14个指针,该指针被称为二次间接指针。二次间接指针指向存储设备上的一个数据块,该数据块中不保存用户数据,而是保存大量的一次间接指针。假定存储设备被文件系统格式化成4KB大小的数据块,每个指针占4字节,则每个数据块中可包含1024个指针。这样,每个二次间接指针可索引1024个一次间接指针,每个一次间接指针可索引1024个数据块,通过引入二次间接指针可以将文件系统能够支持的最大文件再次增加1024×1024个数据块。如果一个大文件所含有的数据块超过了二次间接指针的索引能力,则可启用inode中的第15个指针,即三次间接指针,三次间接指针所指向的数据块中包含1024个二次间接指针,最终将文件系统能够支持的最大文件再次增加1024×1024×1024个数据块。总之,通过三次间接寻址,Ext系列文件系统能够支持的文件最多包含12+1024+1024×1024+1024×1024×1024个数据块,如果每个数据块为4KB,则可保存大小为4TB的单个文件,以上存储能力能够满足大多数场景下的应用需求。
多次间接寻址能够显著扩展文件系统所能支持的最大文件,然而却不能保证文件访问的性能。具体地,应用程序读取一个大文件的某个数据块时,可能需要依次获取文件的inode、三次间接指针、二次间接指针、一次间接指针,最后得到目标数据所在的数据块号,并从中读取数据。简言之,应用程序发出的一个读取数据请求最终引发了5次存储设备的读操作(分别是4次间接指针读取和1次数据读取),由于这5次读操作具有依赖关系不能并发,导致在最坏的情况下应用程序能获得的IO性能仅为存储设备实际性能的1/5。
BtrFS采用B+树索引每个文件占据的所有数据块,即为每个文件创建一个B+树,该文件占据的数据块插入到对应的B+树中,读写文件时根据读写位置的偏移在B+树中搜索,直到找到对应的数据块。以上技术同样难以应对大文件的场景,随着文件变大,对应的B+ 树越来越深,在B+树中搜索目标数据块的延迟越来越大,从而导致文件读写性能的下降。
针对大文件场景下数据块读写延迟增加的问题,Ext4文件系统提出Extent的概念,减少数据块的索引开销,避免多次间接指针(如:二次间接指针和三次间接指针)。具体地,不同于传统文件系统普遍采用的定长数据块,Extent是变长的,具体可为定长数据块的整数倍。对于一个大文件,可以采用少量的、但长度很长的Extent记录其在存储设备上占据的数据块。由于Extent相对较长,使用少量Extent即可表示一个大文件,在读写文件的数据块时只需很小的检索开销即可查找到对应的数据块。一般说来,针对一个文件的一系列连续写入操作对应的数据会被合并到一个Extent中,连续写入的数据越多,Extent的长度越长,对文件读写的性能优化越有利。然而,当前的操作系统普遍支持多核多线程,一个文件系统往往同时面临针对大量文件的并发写入,这样,针对某个特定文件的连续写入操作往往被打乱,从而导致难以创建很长的Extent,针对多文件的交叉乱序写入会显著降低Extent技术的优化效果。
总之,当前的文件系统在面临大文件时都难以满足数据访问的低延迟需求,而数据访问延迟在很多场景下是影响应用性能的关键因素。例如,许多应用将大量小文件合并成一个大文件保存在文件系统中,上层应用对这些小文件的随机访问变成了对大文件不同偏移地址的随机访问,此时,对大文件的随机访问延迟成为影响应用性能的关键因素。类似的,在高性能计算场景中,应用程序往往对大文件发出不连续、有固定的间隔的读写请求,在这种场景下,应用对数据访问延迟(而非带宽)提出了很高的要求。面向以上应用场景,设计低延迟的本地文件系统具有重要意义。
【发明内容】
本发明要解决的技术问题:针对现有技术的上述问题,提供一种低延迟的文件系统地址空间管理方法、系统及介质,本发明能够保证仅通过一次IO操作即可获知一个文件在存储设备上占据哪些数据块,而非像Ext文件系统那样需要最多4次IO操作用来读取间接指针,通过这种优化措施可以显著降低文件的读写延迟,能够显著降低文件读写的寻址开销、能够显著提升文件读写的顺序性,从而提升读写性能。
为了解决上述技术问题,本发明采用的技术方案为:
一种低延迟的文件系统地址空间管理方法,实施步骤包括:
1)将存储设备的地址空间生成超级块和块组分配表,所述超级块存储有文件系统信息和块组在存储设备线性地址空间上的分配情况,所述块组分配表用于标记对应块组中数据块的分配情况;
2)创建文件时根据指定数据块大小动态创建或选择对应的块组并分配数据块,在分 配的数据块写入文件数据并更新块组分配表以及超级块的信息。
可选地,所述指定数据块大小为2的整数次幂KB。
可选地,步骤2)中根据指定数据块大小动态创建或选择对应的块组并分配数据块的详细步骤包括:
A1)判断是否仍有空余数据块的指定数据块大小的块组,如果有则将该块组作为目标块组,如果没有则创建新的指定数据块大小的块组,将该新的块组作为目标块组;
A2)为创建的文件在目标块组中分配数据块,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息;
A3)判断文件是否已经写入完毕,如果尚未写入完毕则跳转执行步骤A1);否则若文件已经写入完毕则结束并退出。
可选地,步骤2)中根据指定数据块大小动态创建或选择对应的块组并分配数据块的详细步骤包括:
B1)指定一组单调递增的数据块大小a 0~a m,将数据块大小a 0作为当前数据块大小a i
B2)判断是否仍有空余数据块的当前数据块大小a i的块组,如果有则将该块组作为目标块组,如果没有则创建新的当前数据块大小的块组,将该新的块组作为目标块组;
B3)为创建的文件在目标块组中分配数据块并记录分配数据块的数量,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息,且当分配的分配数据块小于预设阈值n时,跳转执行步骤B4);
B4)判断文件是否已经写入完毕,如果尚未写入完毕则从数据块大小a 0~a m中选择比当前数据块大小a i更大的相邻的数据块大小作为新的当前数据块大小a i,跳转执行步骤B2);否则若文件已经写入完毕则结束并退出。
可选地,步骤B4)之后还包括计算创建的文件大小FileSize的步骤,且文件大小FileSize的计算函数表达式如下式所示:
Figure PCTCN2020097671-appb-000001
上式中,a 0为指定的最小数据块大小,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数。
可选地,步骤2)之后还包括读文件或覆盖写文件的步骤:首先通过一次IO操作获取将文件对应的数据块指针表从存储设备读取到内存中,所述数据块指针表为文件占据的N个数据块对应的块号组成的表,所述数据块指针表存储在文件系统的索引节点信息中;然 后根据数据块指针表计算文件读文件或覆盖写文件的数据所在的数据块,从计算所得的数据块中进行数据读或覆盖写。
可选地,步骤2)之后还包括追加写文件的步骤:
C1)接收对文件f的追加写请求,该追加写请求的写数据长度为l字节;
C2)判断文件f所占据N个数据块中最后一个数据块的剩余空间大于等于追加写请求的写数据长度l是否成立,如果成立则跳转执行步骤C3);否则跳转执行步骤C4);
C3)将l字节的数据写到已经分配的数据块中,结束并退出;
C4)在文件f所占据N个数据块中最后一个数据块的剩余空间中写入部分数据,然后更新追加写请求的写数据长度l,使得写数据长度l的新值为原值减去文件f所占据N个数据块中最后一个数据块的剩余空间;
C5)为文件f分配新的数据块;
C6)将文件f所占据数据块的数量N加1,跳转执行步骤C2)。
可选地,所述文件f所占据N个数据块中最后一个数据块的剩余空间的计算函数表达式为:
Figure PCTCN2020097671-appb-000002
上式中,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数,L表示文件f的已有长度。
此外,本发明还提供一种低延迟的文件系统地址空间管理系统,包括计算机设备,该计算机设备至少包括微处理器和存储器,该计算机设备的微处理器被编程或配置以执行所述低延迟的文件系统地址空间管理方法的步骤,或该计算机设备的存储器上存储有被编程或配置以执行所述低延迟的文件系统地址空间管理方法的计算机程序。
此外,本发明还提供一种计算机可读存储介质,该计算机可读存储介质上存储有被编程或配置以执行所述低延迟的文件系统地址空间管理方法的计算机程序。
和现有技术相比,本发明具有下述优点:
(1)本发明提出的地址空间管理方法能够显著降低文件读写的寻址开销。
上层应用在读写文件时会给出所读写的数据在文件中的偏移,文件系统需要根据该偏移确定所读写的数据在存储设备上的数据块,此过程被称为寻址过程。Ext文件系统的寻址过程就是查找多级间接指针的过程,BtrFS的寻址过程就是搜索B+树的过程。根据上面的分析,以上过程均需要多次IO操作,从而会显著增加读写数据的延迟。本发明通过采用变长的数据块来减少一个文件包含的数据块总数,将文件包含的数据块总数控制在数百 个以内,从而使得通过一次IO操作就能够将这些数据块指针从存储设备上读取上来,显著减少了寻址过程中的IO操作。进一步的,由于数据块指针占用的空间较少,文件系统甚至可以将这些数据块指针保存在文件的inode中,在读取inode信息时将这些指针一同从存储设备读出,从而避免了寻址过程中的IO操作。总之,采用本发明设计的文件系统可以显著降低寻址过程中的IO操作,因此降低文件读写延迟。
(2)本发明能够显著提升文件读写的顺序性,从而提升读写性能。传统的文件系统一般采用定长的数据块,如4KB数据块,这就会导致一个文件占据大量的数据块。如,一个4GB的文件会占据一百万个4KB数据块。尽管文件系统采取许多优化措施提升属于同一文件的数据块在存储设备上的连续性,但由于在同一时间段内可能存在多个应用对文件系统发出并发的写操作,实现如此多的数据块在存储设备上连续存放是很困难的。而对于大多数存储设备,连续读写性能显著优于随机读写,提升文件存储的连续性对性能优化至关重要。本发明保证数百个数据块即可存储任何大小的文件,因此文件存储的连续性显著优于传统的文件系统,从而能够充分发挥存储设备的连续读写优势。
【附图说明】
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本发明实施例方法的基本流程示意图。
图2为本发明实施例中存储设备的地址空间分布。
图3为本发明实施例中追加写文件的流程示意图。
【具体实施方式】
下面结合附图及具体实施例对本发明进行详细说明。需要说明的是,本发明实施方式的关键在于如何在存储设备的线性地址空间上组织数据块、以及如何从文件系统中读写数据。其中,读取一个文件的数据相对简单,只需根据所读数据在文件中的偏移以及该文件的数据块指针表计算应该读取的数据块。覆盖写与读数据类似,此处对这两种情形的实施方式不作详细讨论。下面重点介绍数据块组织方式和追加写的实施方式。
如图1所示,本实施例低延迟的文件系统地址空间管理方法的实施步骤包括:
1)将存储设备的地址空间生成超级块和块组分配表,所述超级块存储有文件系统信息和块组在存储设备线性地址空间上的分配情况,所述块组分配表用于标记对应块组中数据块的分配情况;
2)创建文件时根据指定数据块大小动态创建或选择对应的块组并分配数据块,在分 配的数据块写入文件数据并更新块组分配表以及超级块的信息。
传统文件系统一般将存储设备格式化为长度固定的数据块,然后把大量的数据块划分为若干块组。例如,Ext4文件系统将存储设备格式化为可选的1KB、4KB、8KB数据块,然后根据数据块大小确定块组大小,最终将整个存储设备划分为若干块组。XFS也将设备存储空间划分为分配组,每个分配组内部相当于一个独立的文件系统。上述文件系统的共同特点是:数据块的长度是固定的,为了避免小文件造成的空间浪费,数据块大小一般设置为小于8KB。这样,一个大文件势必包含大量的数据块,而为这些数据块建立索引以方便查找将带来巨大的开销。本实施例提出采用变长数据块的方法,即将存储设备地址空间划分为块组,每个块组内部的数据块是定长的,但不同的块组可以采用不同大小的数据块,创建文件时根据指定数据块大小动态创建或选择对应的块组并分配数据块。例如,不同块组内的数据块可以选择为1KB、2KB、4KB、8KB等2的整数次幂,最大数据块根据需求甚至可以设置为512MB、1GB。文件系统无需在初始化时建立块组,而是在接收用户创建文件的过程中根据文件的需求情况动态建立块组。具体地,用户在创建文件时会根据需要从不同的块组中获取数据块,一旦一个块组中的数据块分配完毕,就需要再次建立一个具有相同大小数据块的块组。一个块组一旦在存储设备上已经建立,其内部的数据块大小即被设定,不可更改。但一个块组被销毁后,其占据的存储空间可以重新分配到其它块组中,或建立新的块组,此时可以重新设置块组内的数据块大小。由于块组内的数据块大小可以指定,所以数据块可以很大(而不像传统文件系统一般采用16KB以下的数据块),使得文件系统采用较少的数据块即可保存一个大文件,以上优化措施保证大文件无需采用多次间接指针索引数据块,从而避免了传统文件系统在读写大文件时为获取间接指针而发出的多次IO操作。
图2为本实施例中存储设备的地址空间划分方式,在线性地址空间的起始部分有两个数据块用来存储文件系统超级块(Superblock)的两个副本,超级块主要记录文件系统的一些基本信息,本发明将组块在存储设备线性地址空间上的分配情况也记录在超级块中。在超级块之后的是大量的块组分配表,每个块组分配表为4KB。块组分配表后面就是大量的块组,其中,每个块组与前面的一个块组分配表对应。作为一种具体的可选实施方式,本实施例中块组分配表用来标识对应块组中数据块的分配情况,块组分配表中的一位用来标记块组中的一个数据块,该位置为0则标识对应数据块未分配,该位置为1则说明对应数据块已经分配。因为一个块组分配表为4KB,共包含32768个位,因此,每个块组包含32768个数据块。
作为一种可选的实施方式,根据指定数据块大小动态创建或选择对应的块组并分配数 据块可采用固定的指定数据块大小,该方式下步骤2)中根据指定数据块大小动态创建或选择对应的块组并分配数据块的详细步骤包括:
A1)判断是否仍有空余数据块的指定数据块大小的块组,如果有则将该块组作为目标块组,如果没有则创建新的指定数据块大小的块组,将该新的块组作为目标块组;
A2)为创建的文件在目标块组中分配数据块,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息;
A3)判断文件是否已经写入完毕,如果尚未写入完毕则跳转执行步骤A1);否则若文件已经写入完毕则结束并退出。
文件在创建后需要为它分配一定的数据块以准备接收上层应用写入的数据。文件在初始创建时并不能确定其最终有多大,作为一种可选的实施方式,因此采用保守的策略为其分配较小的数据块,根据指定数据块大小动态创建或选择对应的块组并分配数据块可采用动态变化的指定数据块大小,该方式下步骤2)中根据指定数据块大小动态创建或选择对应的块组并分配数据块的详细步骤包括:
B1)指定一组单调递增的数据块大小a 0~a m,将数据块大小a 0作为当前数据块大小a i
B2)判断是否仍有空余数据块的当前数据块大小a i的块组,如果有则将该块组作为目标块组,如果没有则创建新的当前数据块大小的块组,将该新的块组作为目标块组;
B3)为创建的文件在目标块组中分配数据块并记录分配数据块的数量,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息,且当分配的分配数据块小于预设阈值n时,跳转执行步骤B4);
B4)判断文件是否已经写入完毕,如果尚未写入完毕则从数据块大小a 0~a m中选择比当前数据块大小a i更大的相邻的数据块大小作为新的当前数据块大小a i,跳转执行步骤B2);否则若文件已经写入完毕则结束并退出。
例如数据块大小a 0~a m以a 0=1KB起步,则首先采用保守的策略为其分配1KB的数据块大小,随着上层应用不断将数据写入文件中,该文件占有的1KB小数据块不断增多,当超过一定阈值(假定为n)可以为其分配2KB的数据块;随着数据的进一步写入,该文件占据的2KB数据块也超过阈值n,则可为其分配4KB的数据块;以此类推,随着文件的不断增大,分配给该文件的数据块大小成指数增长,但每一类数据块仅为其分配n个。上述数据块分配原则保证在一个文件占据的数据块数目线性增长的情况下,文件的大小可以成指数增长,从而使得采用较少的数据块指针即可索引很大的文件。
基于上述原则,文件大小FileSize可以写成一个等比数列求和的形式。本实施例中, 步骤B4)之后还包括计算创建的文件大小FileSize的步骤,且文件大小FileSize的计算函数表达式如下式所示:
Figure PCTCN2020097671-appb-000003
上式中,a 0为指定的最小数据块大小,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数。从上式可以看出,随着N的线性增长,FileSize将会成指数增长。具体地,假定n为4,N与FileSize具有如下表的对应关系:
表1:N与FileSize之间的对应关系。
N FileSize
4 4KB
8 12KB
12 28KB
16 60KB
20 124KB
24 252KB
28 508KB
32 1020KB
64 ≈256MB
128 ≈16TB
观察上表可以发现,当一个文件采用128个指针时,该文件最大可达到16TB,可以满足绝大多数应用的需求。假定每个指针占4字节,128个指针仅占512字节,可通过一次IO操作将所有的指针从存储设备上获取到内存中,从而避免了Ext文件系统那样需要多次IO操作逐层获取间接指针的情况。在下文中,本实施例将一个文件占据的N个数据块对应的块号组成的表称作数据块指针表,该表中记录了文件对应的数据保存在哪些数据块中。注意,由于文件的大小不同,每个文件对应的N的取值不同。
本实施例中,步骤2)之后还包括读文件的步骤:首先通过一次IO操作获取将文件对应的数据块指针表从存储设备读取到内存中,所述数据块指针表为文件占据的N个数据块对应的块号组成的表,所述数据块指针表存储在文件系统的索引节点(inode)信息中;然后根据数据块指针表计算文件读文件的数据所在的数据块,从计算所得的数据块中进行数据读。从一个文件读取数据时,首先获取该文件对应的数据块指针表。根据前面分析,该表占据的空间不超过512字节,因此可通过一次IO操作即可将它从存储设备读取到内存 中。具体地,文件系统可以将该表保存在inode信息中,在读取文件inode时一同读取该表,进一步减少读取用户数据所需要的IO操作。根据该表可以计算文件任意偏移的数据所在的数据块,接下来可直接从计算所得的数据块中读取用户数据。
本实施例中,步骤2)之后还包括覆盖写文件的步骤:首先通过一次IO操作获取将文件对应的数据块指针表从存储设备读取到内存中,所述数据块指针表为文件占据的N个数据块对应的块号组成的表,所述数据块指针表存储在文件系统的索引节点信息中;然后根据数据块指针表计算覆盖写文件的数据所在的数据块,从计算所得的数据块中进行覆盖写。
向一个文件写入数据时可能出现两种情况。第一种情况是覆盖写,即针对一个文件在已经分配好的数据块上覆盖原有的数据,这种情况无需为文件分配新的数据块,此时,写数据与上面讨论的读数据情况类似,只需根据数据块指针表计算应该写入的目标数据块即可。第二种情况是追加写,即针对一个文件在其尾部写入更多的数据,这种情况要求为文件分配新的数据块。如图3所示,本实施例中步骤2)之后还包括追加写文件的步骤:
C1)接收对文件f的追加写请求,该追加写请求的写数据长度为l字节;
C2)判断文件f所占据N个数据块中最后一个数据块的剩余空间大于等于追加写请求的写数据长度l是否成立,如果成立则跳转执行步骤C3);否则跳转执行步骤C4);
C3)将l字节的数据写到已经分配的数据块中,结束并退出;
C4)在文件f所占据N个数据块中最后一个数据块的剩余空间中写入部分数据,然后更新追加写请求的写数据长度l,使得写数据长度l的新值为原值减去文件f所占据N个数据块中最后一个数据块的剩余空间;
C5)为文件f分配新的数据块;
C6)将文件f所占据数据块的数量N加1,跳转执行步骤C2)。
文件f占据的最后一个数据块很可能没有写满数据,若最后一个数据块剩余的空间大于l,则不用申请新的数据块,即可将本次追加写的数据写入到文件中。因为文件f共占据N个数据块,其占据的存储空间可计算为:
Figure PCTCN2020097671-appb-000004
上式中,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数。假设文件f的已有长度为L,则文件f所占据N个数据块中最后一个数据块的剩余空间的计算函数表达式为:
Figure PCTCN2020097671-appb-000005
上式中,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数,L表示文件f的已有长度。所以,步骤C2)将该值与l比较,若比l更大,则说明还有最够的剩余空间。
步骤C4)中,因为文件f的最后一个数据块没有足够的空间接收长度为l的数据,则只能将所追加数据的一部分写入到最后一个数据块。鉴于最后一个数据块的剩余空间如前式所示,所以写入最后一个数据块的同样为:
Figure PCTCN2020097671-appb-000006
且更新追加写请求的写数据长度l的函数表达式为:
Figure PCTCN2020097671-appb-000007
假定该文件已经占据N个数据块,则其已有的数据块为n个1KB数据块、n个2KB数据块、n个4KB数据块等等依次排列,根据本实施例提出的数据块分配原则,下一步需要为其分配的数据块大小为
Figure PCTCN2020097671-appb-000008
此时,向数据块大小为
Figure PCTCN2020097671-appb-000009
的块组提出申请,一旦数据块分配成功,即可向新分配的数据块中写入待追加的数据。因此,步骤C5)为文件f分配新的数据块时,所分配的数据块的大小为
Figure PCTCN2020097671-appb-000010
其中N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数。
随着上层应用向文件系统中不断地写入数据,每个块组中的数据块逐渐被消耗,当一个块组中的数据块低于某一设定阈值后,需要重新分配一个具有相同大小数据块的块组。此时,可从存储设备上划分一块空闲空间,建立一个新的块组,并在文件系统的超级块中记录该块组的信息。
当文件被删除时,其占据的数据块也将被回收。如果其它文件有写入数据的需求,将优先分配这些被回收的数据块。然而,当文件系统中所保存文件的大小分布出现明显的变化时,可能出现一个块组中的大量数据块被回收、但不会重新被分配给其它文件的情况。例如,一个文件系统中前期保存了大量的小文件,导致文件系统建立了大量的包含小数据块的块组(如包含1KB数据块的块组、包含2KB数据块的块组)。后面随着上层应用负载的变化,小文件被逐步删除,取而代之的是创建了大量的大文件。这时,文件系统需要创建大量包含大数据块的块组。然而,此时存储设备上可能已经没有空闲空间用来建立新的 块组,这就要求回收部分被小数据块占据的块组,用来建立包含大数据块的块组。反之,当文件系统中前期保存大量的大文件、后期转换为保存大量的小文件,则需要回收部分大数据块占据的块组,用来建立包含小数据块的块组。
块组回收时,首先选定两个含有较多空闲数据块、且数据块大小一样的块组。在这两个块组中,假定块组一中含有相对较多的空闲块,则将块组一中的有效数据迁移到块组二中,因为这样迁移的开销较小。若块组一中的有效数据迁移完毕,则可将该块组回收,重新格式化为其它块组。块组回收机制保证采用本发明所设计的文件系统能够灵活应对负载变化。
此外,本实施例还提供一种低延迟的文件系统地址空间管理系统,包括计算机设备,该计算机设备至少包括微处理器和存储器,该计算机设备的微处理器被编程或配置以执行前述低延迟的文件系统地址空间管理方法的步骤,
此外,本实施例还提供一种低延迟的文件系统地址空间管理系统,包括计算机设备,该计算机设备至少包括微处理器和存储器,该计算机设备的存储器上存储有被编程或配置以执行前述低延迟的文件系统地址空间管理方法的计算机程序。
此外,本实施例还提供一种一种计算机可读存储介质,该计算机可读存储介质上存储有被编程或配置以执行前述低延迟的文件系统地址空间管理方法的计算机程序。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的 处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (10)

  1. 一种低延迟的文件系统地址空间管理方法,其特征在于实施步骤包括:
    1)将存储设备的地址空间生成超级块和块组分配表,所述超级块存储有文件系统信息和块组在存储设备线性地址空间上的分配情况,所述块组分配表用于标记对应块组中数据块的分配情况;
    2)创建文件时根据指定数据块大小动态创建或选择对应的块组并分配数据块,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息。
  2. 根据权利要求1所述的低延迟的文件系统地址空间管理方法,其特征在于,所述指定数据块大小为2的整数次幂KB。
  3. 根据权利要求1所述的低延迟的文件系统地址空间管理方法,其特征在于,步骤2)中根据指定数据块大小动态创建或选择对应的块组并分配数据块的详细步骤包括:
    A1)判断是否仍有空余数据块的指定数据块大小的块组,如果有则将该块组作为目标块组,如果没有则创建新的指定数据块大小的块组,将该新的块组作为目标块组;
    A2)为创建的文件在目标块组中分配数据块,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息;
    A3)判断文件是否已经写入完毕,如果尚未写入完毕则跳转执行步骤A1);否则若文件已经写入完毕则结束并退出。
  4. 根据权利要求1所述的低延迟的文件系统地址空间管理方法,其特征在于,步骤2)中根据指定数据块大小动态创建或选择对应的块组并分配数据块的详细步骤包括:
    B1)指定一组单调递增的数据块大小a 0~a m,将数据块大小a 0作为当前数据块大小a i
    B2)判断是否仍有空余数据块的当前数据块大小a i的块组,如果有则将该块组作为目标块组,如果没有则创建新的当前数据块大小的块组,将该新的块组作为目标块组;
    B3)为创建的文件在目标块组中分配数据块并记录分配数据块的数量,在分配的数据块写入文件数据并更新块组分配表以及超级块的信息,且当分配的分配数据块小于预设阈值n时,跳转执行步骤B4);
    B4)判断文件是否已经写入完毕,如果尚未写入完毕则从数据块大小a 0~a m中选择比当前数据块大小a i更大的相邻的数据块大小作为新的当前数据块大小a i,跳转执行步骤B2);否则若文件已经写入完毕则结束并退出。
  5. 根据权利要求4所述的低延迟的文件系统地址空间管理方法,其特征在于,步骤 B4)之后还包括计算创建的文件大小FileSize的步骤,且文件大小FileSize的计算函数表达式如下式所示:
    Figure PCTCN2020097671-appb-100001
    上式中,a 0为指定的最小数据块大小,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数。
  6. 根据权利要求1所述的低延迟的文件系统地址空间管理方法,其特征在于,步骤2)之后还包括读文件或覆盖写文件的步骤:首先通过一次IO操作获取将文件对应的数据块指针表从存储设备读取到内存中,所述数据块指针表为文件占据的N个数据块对应的块号组成的表,所述数据块指针表存储在文件系统的索引节点信息中;然后根据数据块指针表计算文件读文件或覆盖写文件的数据所在的数据块,从计算所得的数据块中进行数据读或覆盖写。
  7. 根据权利要求4所述的低延迟的文件系统地址空间管理方法,其特征在于,步骤2)之后还包括追加写文件的步骤:
    C1)接收对文件f的追加写请求,该追加写请求的写数据长度为l字节;
    C2)判断文件f所占据N个数据块中最后一个数据块的剩余空间大于等于追加写请求的写数据长度l是否成立,如果成立则跳转执行步骤C3);否则跳转执行步骤C4);
    C3)将l字节的数据写到已经分配的数据块中,结束并退出;
    C4)在文件f所占据N个数据块中最后一个数据块的剩余空间中写入部分数据,然后更新追加写请求的写数据长度l,使得写数据长度l的新值为原值减去文件f所占据N个数据块中最后一个数据块的剩余空间;
    C5)为文件f分配新的数据块;
    C6)将文件f所占据数据块的数量N加1,跳转执行步骤C2)。
  8. 根据权利要求7所述的低延迟的文件系统地址空间管理方法,其特征在于,所述文件f所占据N个数据块中最后一个数据块的剩余空间的计算函数表达式为:
    Figure PCTCN2020097671-appb-100002
    上式中,N表示一个文件占据的数据块的总数,n表示在每一类数据块中为一个文件分配的数据块的个数,L表示文件f的已有长度。
  9. 一种低延迟的文件系统地址空间管理系统,包括计算机设备,该计算机设备至少包括微处理器和存储器,其特征在于,该计算机设备的微处理器被编程或配置以执行权利要 求1~8中任意一项所述低延迟的文件系统地址空间管理方法的步骤,或该计算机设备的存储器上存储有被编程或配置以执行权利要求1~8中任意一项所述低延迟的文件系统地址空间管理方法的计算机程序。
  10. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有被编程或配置以执行权利要求1~8中任意一项所述低延迟的文件系统地址空间管理方法的计算机程序。
PCT/CN2020/097671 2020-04-14 2020-06-23 一种低延迟的文件系统地址空间管理方法、系统及介质 WO2021208239A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/638,196 US11853566B2 (en) 2020-04-14 2020-06-23 Management method and system for address space of low delay file system and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010290402.5A CN111522507B (zh) 2020-04-14 2020-04-14 一种低延迟的文件系统地址空间管理方法、系统及介质
CN202010290402.5 2020-04-14

Publications (1)

Publication Number Publication Date
WO2021208239A1 true WO2021208239A1 (zh) 2021-10-21

Family

ID=71911178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097671 WO2021208239A1 (zh) 2020-04-14 2020-06-23 一种低延迟的文件系统地址空间管理方法、系统及介质

Country Status (3)

Country Link
US (1) US11853566B2 (zh)
CN (1) CN111522507B (zh)
WO (1) WO2021208239A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826878A (zh) * 2023-02-14 2023-03-21 浪潮电子信息产业股份有限公司 一种写时拷贝方法、装置、设备及计算机可读存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392040B (zh) * 2021-06-23 2023-03-21 锐捷网络股份有限公司 一种地址映射方法、装置、设备
CN113568868B (zh) * 2021-07-28 2024-02-06 重庆紫光华山智安科技有限公司 文件系统管理方法、系统、电子设备及介质
CN113721862B (zh) * 2021-11-02 2022-02-08 腾讯科技(深圳)有限公司 数据处理方法及装置
CN117991995B (zh) * 2024-03-26 2024-06-07 中国人民解放军海军潜艇学院 一种sd卡文件连续读或写控制方法、系统及存储设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516929A (zh) * 2013-09-27 2015-04-15 伊姆西公司 用于文件系统的方法和装置
CN109254733A (zh) * 2018-09-04 2019-01-22 北京百度网讯科技有限公司 用于存储数据的方法、装置和系统
CN109726145A (zh) * 2018-12-29 2019-05-07 杭州宏杉科技股份有限公司 一种数据存储空间的分配方法、装置及电子设备
US20190220443A1 (en) * 2018-01-18 2019-07-18 EMC IP Holding Company LLC Method, apparatus, and computer program product for indexing a file
CN110221782A (zh) * 2019-06-06 2019-09-10 重庆紫光华山智安科技有限公司 视频文件处理方法及装置
CN110427340A (zh) * 2018-04-28 2019-11-08 伊姆西Ip控股有限责任公司 用于文件存储的方法、装置和计算机存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6256642B1 (en) * 1992-01-29 2001-07-03 Microsoft Corporation Method and system for file system management using a flash-erasable, programmable, read-only memory
KR102530369B1 (ko) * 2018-04-23 2023-05-10 에스케이하이닉스 주식회사 저장 장치 및 그 동작 방법
KR102611566B1 (ko) * 2018-07-06 2023-12-07 삼성전자주식회사 솔리드 스테이트 드라이브 및 그의 메모리 할당 방법
CN110427304A (zh) 2019-07-30 2019-11-08 中国工商银行股份有限公司 用于银行系统的运维方法、装置、电子设备以及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516929A (zh) * 2013-09-27 2015-04-15 伊姆西公司 用于文件系统的方法和装置
US20190220443A1 (en) * 2018-01-18 2019-07-18 EMC IP Holding Company LLC Method, apparatus, and computer program product for indexing a file
CN110427340A (zh) * 2018-04-28 2019-11-08 伊姆西Ip控股有限责任公司 用于文件存储的方法、装置和计算机存储介质
CN109254733A (zh) * 2018-09-04 2019-01-22 北京百度网讯科技有限公司 用于存储数据的方法、装置和系统
CN109726145A (zh) * 2018-12-29 2019-05-07 杭州宏杉科技股份有限公司 一种数据存储空间的分配方法、装置及电子设备
CN110221782A (zh) * 2019-06-06 2019-09-10 重庆紫光华山智安科技有限公司 视频文件处理方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826878A (zh) * 2023-02-14 2023-03-21 浪潮电子信息产业股份有限公司 一种写时拷贝方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
US11853566B2 (en) 2023-12-26
CN111522507A (zh) 2020-08-11
US20220404990A1 (en) 2022-12-22
CN111522507B (zh) 2021-10-01

Similar Documents

Publication Publication Date Title
WO2021208239A1 (zh) 一种低延迟的文件系统地址空间管理方法、系统及介质
Kwon et al. Strata: A cross media file system
US10831734B2 (en) Update-insert for key-value storage interface
CN110825748B (zh) 利用差异化索引机制的高性能和易扩展的键值存储方法
US9471500B2 (en) Bucketized multi-index low-memory data structures
JP6205650B2 (ja) 不均等アクセス・メモリにレコードを配置するために不均等ハッシュ機能を利用する方法および装置
EP2433227B1 (en) Scalable indexing in a non-uniform access memory
US20060218347A1 (en) Memory card
CN111221776A (zh) 面向非易失性内存的文件系统的实现方法、系统及介质
CN109445702B (zh) 一种块级数据去重存储系统
EP1265152B1 (en) Virtual file system for dynamically-generated web pages
CN101488153A (zh) 嵌入式Linux下大容量闪存文件系统的实现方法
CN101571869B (zh) 一种智能卡的文件存储、读取方法及装置
US20150324281A1 (en) System and method of implementing an object storage device on a computer main memory system
CN100424699C (zh) 一种属性可扩展的对象文件系统
CN115427941A (zh) 数据管理系统和控制的方法
US8209513B2 (en) Data processing system with application-controlled allocation of file storage space
CN111143285A (zh) 一种小文件存储文件系统以及小文件处理方法
Neal et al. Rethinking file mapping for persistent memory
CA2758235A1 (en) Device and method for storage, retrieval, relocation, insertion or removal of data in storage units
WO2022205544A1 (zh) 基于Cuckoo哈希的文件系统目录管理方法及系统
WO2022262381A1 (zh) 一种数据压缩方法及装置
KR100907477B1 (ko) 플래시 메모리에 저장된 데이터의 인덱스 정보 관리 장치및 방법
US7424574B1 (en) Method and apparatus for dynamic striping
CN114115711B (zh) 基于非易失内存文件系统的快速缓存系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20931081

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20931081

Country of ref document: EP

Kind code of ref document: A1