WO2020125362A1 - File system and data layout method - Google Patents

File system and data layout method Download PDF

Info

Publication number
WO2020125362A1
WO2020125362A1 PCT/CN2019/121301 CN2019121301W WO2020125362A1 WO 2020125362 A1 WO2020125362 A1 WO 2020125362A1 CN 2019121301 W CN2019121301 W CN 2019121301W WO 2020125362 A1 WO2020125362 A1 WO 2020125362A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
file system
cost
file
area
Prior art date
Application number
PCT/CN2019/121301
Other languages
French (fr)
Chinese (zh)
Inventor
王洋
夏明辉
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020125362A1 publication Critical patent/WO2020125362A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types

Definitions

  • the invention belongs to the technical field of data layout, and particularly relates to a file system and a data layout method.
  • GPFS is the abbreviation of General Parallel File System.
  • GPFS from IBM is a scalable, high-performance, general-purpose parallel file system based on shared disks. GPFS can provide parallel, high-speed, safe, and reliable data access for all nodes in the storage system.
  • PanFS is a parallel file system developed by Panasas.
  • PanFS is a general-purpose parallel file system. At present, its main application field is similar to luster.
  • PanFS is a scalable line that can provide strong consistency through distributed locks.
  • the performance gap between a solid-state drive-based server and a hard disk drive-based server will significantly reduce the performance of the parallel file system, because solid-state drive-based servers are always better than Hard disk drive servers have higher performance, which requires less I/O time to complete the same amount of data access.
  • the existing layout scheme is applied, the scheme will give solid-state drive-based servers and hard disk drive-based servers Allocating the same stripe may result in severe load imbalance between heterogeneous servers.
  • complex I/O workloads may also jeopardize the efficiency of I/O systems.
  • the present invention provides a file system, the file system includes An I/O tracer, a cost calculation module, and an area division module that are electrically connected to each other, and the I/O tracer is used to provide the area division module with the I/O information collected by itself when the file system is running
  • the I/O tracer is also used to provide the cost calculation module with the configuration file of the file system collected by itself; the cost calculation module is used to calculate or estimate the file request in the file system Access cost to output a cost model to the area dividing module; the area dividing module is used to generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas, the area dividing module It is also used to obtain the stripe size corresponding to the area.
  • the file system further includes a daemon process module, the daemon process module is used to execute the daemon process in the background; and the FUSE module is used as an agent of the daemon process.
  • the file system further includes an update data layout module, the update data layout module and the daemon module, the I/O tracer, the area division module, and the hybrid storage system, respectively Connected, the update data layout module is used to dynamically detect and update area changes.
  • the calculation formula of the replication time is: T c (r, h, s) ⁇ 3 (mh+ns) t c , where t c represents the unit data replication time from kernel space to user space, and h represents HServer Band size, s indicates the strip size on SServer, m indicates the number of HServers, and n indicates the number of SServers;
  • the area division model is used to obtain the minimum cost of dividing 1 event into k areas starting from event i
  • the invention also provides a data layout method, which includes:
  • Step S1 Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division
  • Step S2 calculate or estimate the access cost of the file request to form a cost model
  • Step S3 generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;
  • FIG. 1 is a schematic diagram of a data layout scheme using fixed-size strips in the prior art
  • Figure 2 is a schematic diagram of the data layout scheme based on area division
  • Figure 3 is a schematic diagram of a file system based on regional data layout
  • FIG. 5 is an application example diagram of a file system in an embodiment of the present invention.
  • the regional scheme in RLFS is a more fine-grained and more adaptive data layout scheme than the traditional data layout, and corresponds to different stripe sizes in all storage servers. Therefore, the regional scheme in RLFS can be seen as a variant of the 1-DH layout scheme. RLFS can aggregate the bandwidth of all storage servers to maximize I/O performance. RLFS matches the hybrid storage system 100 very well.
  • RLFS aims to support area-based data layout by using file strips of different sizes.
  • RLFS uses a partitioned processing method to achieve the optimal data layout.
  • a cost model is generated in RLFS. According to the cost model, RLFS divides a large file into a set of regions, and each region stores its own strip size separately. When the total cost of all I/O requests of the application is minimized, the optimal regions and their stripe sizes are obtained.
  • the storage system involved in this embodiment is a hybrid storage system 100.
  • the hybrid storage system 100 includes a solid-state drive-based server 102 and a hard disk drive-based server 101, a solid-state drive-based server referred to as SServer, and a hard disk drive-based server. HServer.
  • An embodiment of the present invention provides a file system, which is called a region-level file system, that is, Region Level File System, or RLFS for short.
  • the file system can support regional data layout and solve the data distribution problem in the existing parallel file system.
  • RLFS relies on a defined cost model and a heterogeneous sensing scheme based on each region to determine the optimal file stripe size for each server, and further uses the changed access mode to adjust the regional scheme at runtime.
  • RLFS is storage system and application-aware. RLFS essentially represents a change from the traditional one-dimensional fixed stripe size layout to a two-dimensionally changing stripe size layout. RLFS can adapt well to server performance and application behavior. Variety. In addition, RLFS also updates the generated data layout scheme based on the detected change in access mode to solve the static data layout problem, making it more suitable for file access at runtime.
  • an embodiment of the present invention provides a file system, which is called a regional file system, that is, Region Level File System, abbreviated as RLFS.
  • the file system can support regional data layout and solve the data distribution problem in the existing parallel file system.
  • the kernel part of the RLFS package and the user-level daemon module 20 includes the FUSE module 10.
  • the file system RLFS provided by the embodiments of the present invention is preferably designed based on the FUSE framework.
  • FUSE refers to the user space file system, which is an abbreviation of Filesystem in Userspace.
  • the kernel part is preferably a Linux kernel module, and the kernel part further includes a VFS module 11, which is a virtual file system, which is an abbreviation of Virtual File System.
  • the VFS module 11 is used to register RLFS.
  • a block device is created in the kernel part. The block device acts as an interface between the daemon process module 20 and the kernel part.
  • the FUSE module 10 acts as an agent of the daemon process module 20 for various file systems issued by the application. request.
  • the application program from client 1 can access RLFS by mounting RLFS into its name space, and thereafter, all file system calls directed to the mount point are forwarded to FUSE module 10 through VFS module 11. Then, the FUSE module 10 relays the call instruction in the request queue to the daemon module 20 through the block device, wherein, by contacting the metadata server 200 and or other storage server, an appropriate service processing program is called to adapt to the file system call.
  • the response propagates through the kernel part along the reverse path and eventually propagates back to the application.
  • the application is usually in a waiting state after making a request, waiting for a response.
  • the RLFS daemon and storage server should complete all PFS semantics.
  • the read handler should first identify which storage servers have the requested data segment, and which server stores the corresponding data segment, and then issue sub-requests to these servers for parallel access.
  • the kernel part also includes a file log module 12 for recording operation logs for the metadata server 200.
  • RLFS In addition to the general semantics of PFS, RLFS also needs to implement region-based data layout functions. To achieve this goal, RLFS is equipped with an I/O tracer 3 with three user-level components, a cost calculation module 4 and an area division module 5. RLFS completes a three-phase data layout cycle through three user-level components. The data layout cycle starts from the tracking phase. During the tracking phase, the I/O tracer 3 collects the runtime statistics of data access and the summary of the file system used for cost modeling (for example, FUSE queue information) during application execution. Into the trace file.
  • cost modeling for example, FUSE queue information
  • RLFS can greatly improve the I/O performance of the application in subsequent operations.
  • RLFS also includes an updated data layout module 8, which is connected to the daemon module 20, the I/O tracer 3, the area division module 5, and the hybrid storage system 100, respectively.
  • the updated data layout module is used In order to dynamically update the data layout, the update data layout module is used to dynamically detect and update area changes. . Further, the specific functions of the I/O tracer 3, the cost calculation module 4 and the area division module 5 are separately explained:
  • I/O tracer 3 is used in RLFS to collect both runtime I/O information and file system configuration files.
  • IOSIG [42] the file system provided by the embodiment of the present invention is designed based on the FUSE framework, similar to the existing IOSIG [42]
  • the I/O data collection tools in the technology cannot be directly applied to RLFS. This is determined by the inherent characteristics of the FUSE framework structure. Therefore, in the I/O tracer 3 involved in this embodiment, which follows the N-1 log mode, all RLFS daemons are used to write a single file shared file. Therefore, the designed I/O tracer 3 can help to collect all information of I/O operations, including file access type, operation time, and other process-related data.
  • the cost calculation module 4 can generate a cost model, and the cost model aims to find the minimum total cost.
  • the file system proposed in the embodiment of the present invention needs to rely on the cost calculation module 4.
  • the cost is defined as the I/O completion time of each file request.
  • the cost calculation module 4 is used to calculate the cost of file request access in the file system.
  • the file system is compatible with the hybrid storage system.
  • the cost calculation module 4 should include the system cost of the file system and the network and storage costs.
  • the cost calculation module 4 includes a system cost calculation module 41 and a network and storage cost calculation module 42.
  • the system cost of the file system mainly refers to the time overhead in the FUSE data path. Since the main goal of RLFS is to optimize the read request through the optimal position of the data file on the hybrid storage system 100, only the system cost related to the read request is defined in this embodiment, and the cost of the write request can also be followed by Export with the same parameters.
  • the service time is divided into three sub-parts, one is the waiting time in the FUSE module 10, and the other is the two between the FUSE module 10 and the daemon module 20.
  • the time of context switching the third is the time of the three copy operations collected in the first copy.
  • the time to wait for a read request in the FUSE module 10 queue is closely related to the application running between the client 1 and RLFS.
  • the time to wait for the read request in the FUSE module 10 queue depends not only on the I/O request made by the application Mode, which is also related to other factors caused by the file system, such as page caching or interruption. Therefore, it is difficult to estimate it accurately.
  • Tq 0.
  • reproduction time T c The first copy of the collected time copy operation of three, referred to as reproduction time T c, r is proportional to the size of the data reproduction time T c with the requested file, which is calculated as:
  • the file request data size is r, and the calculation formula for the file request data size r is:
  • s m and s n represent the maximum sub-request size on HServer and the maximum sub-request size on SServer, and s m ⁇ h and s n ⁇ s, h represents the stripe size on HServer, s represents the stripe size on SServer, So, further, the replication time T c can be expressed as:
  • While the network by the network computing and storage costs and storage costs calculation module 42 comprises: a network connection time T e, T a memory access times and network transmission time T x.
  • PFS requests are divided into a set of subtasks, and each subtask is forwarded to a separate storage server for parallel execution. Therefore, the cost of request subcomponents in the network and storage server is determined by the maximum cost of all subrequests.
  • the network transmission time T x can be determined according to the data size (s m and s n ) and the data transmission network time t. The specific formula is:
  • s m s n represent the largest sub-request HServer the maximum size of the child and SServer request size.
  • the storage access time T a is determined by the sub-request.
  • the specific formula is:
  • s m and s n represent the maximum sub-request size on the HServer and the maximum sub-request size on the SServer, respectively.
  • t h and t s represent the unit data transmission time on the HServer and the unit data transmission time on the SServer.
  • the network and storage cost T 2 calculated by the network and storage cost calculation module 42 can be expressed by the formula:
  • h indicates the strip size on the HServer
  • s indicates the strip size on the SServer.
  • the write request involves more operations than read.
  • two contexts are performed between the FUSE module 10 and the daemon module 20
  • the switching time T s needs to include the time for write amplification, garbage collection and wear leveling.
  • the area division module 5 can divide the file into different areas, trying to minimize the total cost of a given access set featuring parallel applications.
  • the existing area division device has HARL, and HARL divides the area division and stripe size determination into two different stages to deal with.
  • the layout strategy of RLFS is integrated, and the layout strategy of RLFS is a unified
  • the method considers the problem of area division and stripe size determination, so RLFS can determine area division and stripe size at a time.
  • RLFS does not scan trace files in a heuristic way to find logical regions like HARL, but puts logical regions and physical blocks together with the goal of minimizing the total cost. This consideration is easy to understand because the smallest unit of file access is a block, such as 64MB or 128MB, and the logical area can naturally span a sequence of adjacent physical blocks.
  • the first algorithm can be executed in the area division module 5.
  • the first algorithm is an offline form of the most relatively fast algorithm.
  • the first algorithm can be repeated periodically to adapt to the dynamic characteristics of the access. "Relatively fast" means that the algorithm is pseudo-polynomial time.
  • the essence of the algorithm is to first represent the shared file as a sequence of blocks, then partition the file in blocks according to the given access request, and finally use the dynamic programming module to partition from these partitions. Find the optimal area division.
  • the data between HServer and SServer in each area is striped, and logical I/O requests can be processed by a single multiple physical requests related to the requested data.
  • the total access cost is minimized according to the defined cost model compared to traditional strategies.
  • An area is defined with a size of It can be expressed as:
  • ⁇ 1 is the expansion factor of SServer relative to HServer
  • B represents the block size in the configuration.
  • FIG. 5 is an application example diagram of a file system in an embodiment of the present invention.
  • the file client 1 issues a request on behalf of the application program from the computing server 301
  • the hybrid storage system 100 is responsible for storing and managing the stripped area
  • the metadata server 200 contains the files stored in the RLFS. Description.
  • the client 1 first contacts the MDS to obtain file metadata, and then uses it to perform data access with the hybrid storage system 100 through the RLFS daemon.
  • a file server in the parallel file system is used to test the context switching time, unit data copy time and unit data transfer time of HServer and SServer with read/write mode. These parameters can vary with different I/ O mode.
  • a pair of nodes, a client node and a file server are used to estimate the network transmission time. The network transmission time test can be repeated thousands of times, and then the average of them is calculated as the parameter value of the generated cost model.
  • RLFS To perform the optimal data layout for a specific file, RLFS first uses its area division module 5 to calculate the optimal area division of the file, and then uses the cost model and I/O tracking data to determine the stripe size of each area.
  • the optimal area information is calculated for writing files on each server at the same time, and the area dividing module 5 creates an RST for subsequent reading of the files in the MDS.
  • MDS holds the RLFS namespace, RST, and other information about each file.
  • the size of the MDS is highly controlled, and the size of the MDS is small.
  • the hybrid storage system 100 maintains a flat namespace, where each file can be identified by "filename_region#_stripe#" in the local disk.
  • filename can contain path information specified by the application.
  • the background I/O daemon is used to receive incoming requests from client 1, which is characterized by “filename”, "region#” and “stripe#”, and serves the request by sending back the requested stripe file. Band files are combined with other band files to meet the needs of the application.
  • a file system (RLFS) proposed by the present invention supports region-level data layout by dividing a file into a set of optimal regions, so that the file system can determine the optimal region and its stripe size. Therefore, by This file system can optimize the data layout of the hybrid storage system 100.
  • using the FUSE module 10 not only greatly simplifies the development work, but also allows access to RLFS through the standard file system interface, allowing applications to access RLFS in a transparent manner, and variable-size RLFS can ease the The load is unbalanced, which can flexibly adapt to workload changes and server heterogeneity, thereby significantly speeding up I/O system performance.
  • RLFS uses the optimal data layout of ⁇ 32KB, 160KB ⁇ and ⁇ 36KB, 148KB ⁇ , respectively, which improves I/O performance by 73.4% and 176.7 compared to the default layout with 64KB fixed-size stripes. %. Compared with other layouts with different but fixed-size stripes, RLFS improves read performance to 138.6% and write performance to 177.6%. Compared with the randomly selected stripe strategy, RLFS improves read performance to 154.5% and write performance to 215.4%.
  • the invention also provides a data layout method, which includes:
  • Step S1 Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division
  • Step S2 calculate or estimate the access cost of the file request to form a cost model
  • Step S3 generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;
  • Step S4 Obtain the stripe size corresponding to the area.
  • This method supports regional data layout by dividing the file into a set of optimal regions to determine the optimal region and its stripe size. This method optimizes the data of the hybrid storage system layout.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a file system comprising a cost calculation module and a region dividing module. The cost calculation module is used for calculating or estimating access cost of a file request in the file system, and is capable of outputting a cost model to the area partition module; the region dividing module is used for dividing files into different regions so as to minimize the total cost of a given access; and the region dividing module is further used for obtaining the size of a stripe corresponding to the region. The invention further provides a data layout method.

Description

文件系统及数据布局方法File system and data layout method 技术领域Technical field
本发明属于数据布局技术领域,尤其涉及一种文件系统和一种数据布局方法。The invention belongs to the technical field of data layout, and particularly relates to a file system and a data layout method.
背景技术Background technique
随着大规模数据密集型应用在各个应用领域不断增加,I/O(输入/输出)性能正成为存储系统的瓶颈。为了解决这个问题,本领域技术人员相继将诸多并行文件系统(Parallel File System,简称PFS)引入高性能存储系统当中。上述的并行文件系统包括OrangeFS、Lustre、GPFS、PanFS和PLFS等,各并行文件系统的简介如下:As large-scale data-intensive applications continue to increase in various application areas, I/O (input/output) performance is becoming a bottleneck for storage systems. To solve this problem, those skilled in the art have successively introduced many parallel file systems (Parallel File System, PFS for short) into high-performance storage systems. The above-mentioned parallel file systems include OrangeFS, Lustre, GPFS, PanFS and PLFS, etc. The brief introduction of each parallel file system is as follows:
1、OrangeFS是虚拟并行文件系统(PVFS)的一个分支,其与PVFS类似,是一个针对高性能计算以及高性能数据访问所提出的一种并行文件系统。与传统的PVFS相比,Ora-ngeFS致力于提高小文件处理的性能、增加服务器的交叉容错及提供安全访问控制。1. OrangeFS is a branch of the Virtual Parallel File System (PVFS), which is similar to PVFS and is a parallel file system proposed for high-performance computing and high-performance data access. Compared with traditional PVFS, Ora-ngeFS is dedicated to improving the performance of small file processing, increasing the cross-fault tolerance of the server and providing secure access control.
2、Lustre是HP、Intel、Cluster File System公司联合美国能源部开发的Linux集群并行文件系统,Lustre采用分布式的锁管理机制来实现并发控制,元数据和文件数据的通讯链路分开管理。2. Lustre is a Linux cluster parallel file system developed by HP, Intel, Cluster, and the United States Department of Energy. Lustre uses a distributed lock management mechanism to achieve concurrency control, and the metadata and file data communication links are managed separately.
3、GPFS是General Parallel File System的缩写。源自IBM公司的GPFS是一个可扩展、高性能、基于共享磁盘的通用并行文件系统,GPFS能为存储系统中的所有节点提供并行、高速、安全、可靠的数据存取。3. GPFS is the abbreviation of General Parallel File System. GPFS from IBM is a scalable, high-performance, general-purpose parallel file system based on shared disks. GPFS can provide parallel, high-speed, safe, and reliable data access for all nodes in the storage system.
4、PanFS是由Panasas公司研发的并行文件系统,PanFS是通用的并行文件系统,目前其主要的应用领域和luster类似,PanFS可扩展行,其能够通过分布式锁提供的强一致性。4. PanFS is a parallel file system developed by Panasas. PanFS is a general-purpose parallel file system. At present, its main application field is similar to luster. PanFS is a scalable line that can provide strong consistency through distributed locks.
5、PLFS是一款开源的并行检查点存储文件系统。5. PLFS is an open source parallel checkpoint storage file system.
综上,基于这些并行文件系统就能够执行跨多个服务器分发数据文件的操作,因此,并行文件系统(PFS)可以允许并行应用的多个任务以聚合的I/O带宽形式同步访问数据文件。In summary, based on these parallel file systems, the operation of distributing data files across multiple servers can be performed. Therefore, the parallel file system (PFS) can allow multiple tasks of parallel applications to simultaneously access data files in the form of aggregated I/O bandwidth.
但是现有的并行文件系统(PFS)也并不是没有缺陷的,其缺陷在于,现有的并行文件系统(PFS)与基于新型存储技术的混合型存储系统不匹配。在逐步展开描述不适配问题之前,首先需要阐明的是基于新型存储技术的混合型存储系统的情况,随着新型存储技术的发展,基于闪存的固态驱动器(Solid State Disk,简称SSD)应用越发广泛,较硬盘驱动器(Hard Disk Drive,简称HDD)而言,固态驱动器具有存储效率高、响应快和成本高的特点,所以,综合考虑,一个合理的存储系统不适合全部由硬盘驱动器组成,因为读写和响应速度偏慢,合理的存储系统也不适宜全部由造价很高的固态驱动器组成,换言之,固态驱动器在一个大型集群中并不会完全取代硬盘驱动器。因此,使用同时包括基于固态驱动器的服务器和基于硬盘驱动器的服务器的混合型存储系统是一种优选策略。这种策略对于有限成本预算下的HPC系统更加实用。HPC是高性能计算(High Performance Computing)机群的简称。However, the existing parallel file system (PFS) is not without defects. The disadvantage is that the existing parallel file system (PFS) does not match the hybrid storage system based on the new storage technology. Before step by step to describe the problem of misfit, the first thing that needs to be clarified is the situation of hybrid storage systems based on new storage technologies. With the development of new storage technologies, the application of solid state drives (Solid State Disk, SSD for short) based on flash memory has become more and more popular. Widely, compared with Hard Disk Drive (HDD), solid-state drives have the characteristics of high storage efficiency, fast response and high cost. Therefore, considering a comprehensive consideration, a reasonable storage system is not suitable for all hard drives, because The read and write and response speeds are slow, and a reasonable storage system is not suitable for all high-cost solid-state drives. In other words, solid-state drives will not completely replace hard drives in a large cluster. Therefore, it is a preferred strategy to use a hybrid storage system that includes both a solid-state drive-based server and a hard disk drive-based server. This strategy is more practical for HPC systems with limited cost budgets. HPC is an abbreviation of High Performance Computing (High Performance Computing) cluster.
另一方面,并行文件系统(PFS)的效率取决于有效的数据文件布局,即数据文件如何在可用节点上分布,大多数现有的布局方案使用固定大小的条带分割成多个服务器上分布数据文件,还利用固定大小的条带提供来自多个服务器的并发数据访问,这甚至使得每个服务器上都有数据放置。虽然现有的布局方案实现简单,易被广泛使用,但这样的布局方案显然是适用于使用了同质服务器的存储系统,不适用于混合型存储系统。On the other hand, the efficiency of the parallel file system (PFS) depends on the effective data file layout, that is, how the data files are distributed on the available nodes. Most existing layout schemes use fixed-size strips to split them into multiple servers. Data files also use fixed-size strips to provide concurrent data access from multiple servers, which even allows data to be placed on each server. Although the existing layout scheme is simple to implement and easy to be widely used, such a layout scheme is obviously suitable for storage systems using homogeneous servers, and is not suitable for hybrid storage systems.
当现有的并行文件系统应用于混合型存储系统时,基于固态驱动器的服务 器和基于硬盘驱动器的服务器之间的性能差距会显著降低并行文件系统的性能,因为基于固态驱动器的服务器总是比基于硬盘驱动器的服务器具有更高的性能,从而需要更少的I/O时间来完成相同数量的数据访问,如果应用现有的布局方案,该方案会给基于固态驱动器的服务器和基于硬盘驱动器的服务器分配相同的条带,这可能会导致异构服务器之间的负载严重不平衡,另外,复杂的I/O工作负载也可能危及I/O系统的效率。When the existing parallel file system is applied to a hybrid storage system, the performance gap between a solid-state drive-based server and a hard disk drive-based server will significantly reduce the performance of the parallel file system, because solid-state drive-based servers are always better than Hard disk drive servers have higher performance, which requires less I/O time to complete the same amount of data access. If the existing layout scheme is applied, the scheme will give solid-state drive-based servers and hard disk drive-based servers Allocating the same stripe may result in severe load imbalance between heterogeneous servers. In addition, complex I/O workloads may also jeopardize the efficiency of I/O systems.
发明内容Summary of the invention
有鉴于此,为解决现有的并行文件系统(PFS)与基于新型存储技术的混合型存储系统匹配时所产生的数据分布不合理的问题,本发明提供一种文件系统,所述文件系统包括相互电性连接的I/O示踪器、成本计算模块和区域划分模块,所述I/O示踪器用于向所述区域划分模块提供自身收集到所述文件系统运行时的I/O信息;所述I/O示踪器还用于向所述成本计算模块提供自身收集到的所述文件系统的配置文件;所述成本计算模块用于计算或预估所述文件系统中文件请求的访问成本,以向所述区域划分模块输出成本模型;所述区域划分模块用于根据所述成本模型生成总成本最小化的分布区域,并将文件划分到的不同区域中,所述区域划分模块还用于获得所述区域对应的条带大小。In view of this, in order to solve the problem of unreasonable data distribution when the existing parallel file system (PFS) is matched with the hybrid storage system based on the new storage technology, the present invention provides a file system, the file system includes An I/O tracer, a cost calculation module, and an area division module that are electrically connected to each other, and the I/O tracer is used to provide the area division module with the I/O information collected by itself when the file system is running The I/O tracer is also used to provide the cost calculation module with the configuration file of the file system collected by itself; the cost calculation module is used to calculate or estimate the file request in the file system Access cost to output a cost model to the area dividing module; the area dividing module is used to generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas, the area dividing module It is also used to obtain the stripe size corresponding to the area.
较佳地,所述文件系统还包括内核部分,所述内核部分用于执行元数据服务器、混合型存储系统和客户端三方之间的信息或数据的交互;所述内核部分包括FUSE模块。Preferably, the file system further includes a kernel part, and the kernel part is used to perform information or data interaction between a metadata server, a hybrid storage system, and a client; the kernel part includes a FUSE module.
较佳地,所述文件系统还包括守护进程模块,所述守护进程模块用于在后台执行守护进程;所述FUSE模块用于作为所述守护进程的代理。Preferably, the file system further includes a daemon process module, the daemon process module is used to execute the daemon process in the background; and the FUSE module is used as an agent of the daemon process.
较佳地,所述文件系统还包括更新数据布局模块,所述更新数据布局模块 分别与所述守护进程模块、所述I/O示踪器、所述区域划分模块和所述混合型存储系统连接,所述更新数据布局模块用于动态检测和更新区域变化。Preferably, the file system further includes an update data layout module, the update data layout module and the daemon module, the I/O tracer, the area division module, and the hybrid storage system, respectively Connected, the update data layout module is used to dynamically detect and update area changes.
较佳地,所述成本计算模块用于计算请求的总成本,总成本计算公式为:T=T s+T c+T 2,公式中,T s表示所述FUSE模块和所述守护进程模块之间进行两个上下文切换的时间,T c表示复制时间,T 2表示网络和存储成本。 Preferably, the cost calculation module is used to calculate the total cost of the request, and the total cost calculation formula is: T=T s +T c +T 2 , where T s represents the FUSE module and the daemon module The time between two context switches, T c represents the replication time, and T 2 represents the network and storage costs.
较佳地,所述混合型存储系统包括包括基于固态驱动器的服务器SServer和基于硬盘驱动器的服务器HServer;Preferably, the hybrid storage system includes a server SServer based on solid-state drives and a server HServer based on hard drives;
所述复制时间的计算公式为:T c(r,h,s)≈3(mh+ns)t c,公式中t c表示从内核空间到用户空间的单元数据复制时间,h表示HServer上条带尺寸,s表示SServer上条带尺寸,m表示HServer的数量,n表示SServer的数量; The calculation formula of the replication time is: T c (r, h, s) ≈ 3 (mh+ns) t c , where t c represents the unit data replication time from kernel space to user space, and h represents HServer Band size, s indicates the strip size on SServer, m indicates the number of HServers, and n indicates the number of SServers;
所述网络和存储成本的计算公式为:T 2≈T e+max{h(t h+t),s(t s+t)},公式中,t表示数据传输网络时间,t h和t s分别表示HServer上单元数据传输时间和SServer上单元数据传输时间,T e表示网络连接时间。 The calculation formula of the network and storage cost is: T 2 ≈T e +max{h(t h +t),s(t s +t)}, where t represents the data transmission network time, t h and t s represents the unit data transmission time on HServer and SServer, respectively, and T e represents the network connection time.
较佳地,所述区域划分模用于获取从事件i开始将l个事件划分为k个区域的最小成本
Figure PCTCN2019121301-appb-000001
所述最小成本
Figure PCTCN2019121301-appb-000002
的计算公式为:
Preferably, the area division model is used to obtain the minimum cost of dividing 1 event into k areas starting from event i
Figure PCTCN2019121301-appb-000001
The minimum cost
Figure PCTCN2019121301-appb-000002
Is calculated as:
Figure PCTCN2019121301-appb-000003
Figure PCTCN2019121301-appb-000003
公式中,
Figure PCTCN2019121301-appb-000004
定义了一个大小为
Figure PCTCN2019121301-appb-000005
区域,
Figure PCTCN2019121301-appb-000006
表示尺寸为f的第一区域的成本。
formula,
Figure PCTCN2019121301-appb-000004
Defines a size as
Figure PCTCN2019121301-appb-000005
area,
Figure PCTCN2019121301-appb-000006
Represents the cost of the first area of size f.
较佳地,基于固态驱动器的服务器和基于硬盘驱动器的服务器能够将
Figure PCTCN2019121301-appb-000007
条带化,并分别得到h i和s i,s i的计算公式为s i=αh i,h i的计算公式为:
Preferably, servers based on solid-state drives and servers based on hard drives can
Figure PCTCN2019121301-appb-000007
Striping, respectively and h i and s i, s i is calculated as s i = αh i, h i is calculated as:
Figure PCTCN2019121301-appb-000008
Figure PCTCN2019121301-appb-000008
公式中,α≥1且是SServer相对于HServer的扩展因子,B表示配置中的块大小。In the formula, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.
本发明还提供一种数据布局方法,其包括:The invention also provides a data layout method, which includes:
步骤S1,将运行时的数据访问的I/O信息以及用于成本建模的文件系统配置文件收集到跟踪文件中,将文件系统配置文件定向用于建立成本模型,将I/O信息用于区域划分;Step S1: Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division
步骤S2,计算或预估文件请求的访问成本,形成成本模型;Step S2, calculate or estimate the access cost of the file request to form a cost model;
步骤S3,根据所述成本模型以生成总成本最小化的分布区域,并将文件划分到的不同区域中;Step S3, generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;
步骤S4,获取所述区域对应的条带大小。Step S4: Obtain the stripe size corresponding to the area.
本发明实施例与现有技术相比存在的有益效果是:Compared with the prior art, the beneficial effects of the embodiments of the present invention are:
本发明提出的一种文件系统通过将文件划分为一组最优区域的方式来支持区域级的数据布局,而且通过该文件系统能够确定最优区域及其条带大小,故此,通过该文件系统能够优化混合型存储系统100的数据布局。该文件系统能够灵活地适应工作负载的变化和服务器异构性,从而显著加快I/O系统性能。A file system proposed by the present invention supports region-level data layout by dividing a file into a set of optimal regions, and the file system can determine the optimal region and its stripe size. Therefore, the file system The data layout of the hybrid storage system 100 can be optimized. The file system can flexibly adapt to workload changes and server heterogeneity, thereby significantly speeding up I/O system performance.
附图说明BRIEF DESCRIPTION
图1为现有技术中使用固定大小条带的数据布局方案示意图;FIG. 1 is a schematic diagram of a data layout scheme using fixed-size strips in the prior art;
图2为基于区域划分的数据布局方案示意图;Figure 2 is a schematic diagram of the data layout scheme based on area division;
图3为基于区域数据布局的文件系统的示意图;Figure 3 is a schematic diagram of a file system based on regional data layout;
图4为本发明实施例中成本计算模块的工作原理示意图;4 is a schematic diagram of the working principle of the cost calculation module in the embodiment of the present invention;
图5为本发明实施例中文件系统的一个应用示例图。FIG. 5 is an application example diagram of a file system in an embodiment of the present invention.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
为了说明本发明所述的技术方案,下面通过具体实施例来进行说明。In order to explain the technical solution of the present invention, the following will be described with specific embodiments.
实施例Examples
一般来说,大多数PFS采用三种典型的数据布局:1-DH、1-DV和2-D。1-DH布局是指一个客户端进程能够从所有存储服务器访问数据。与1-DH布局相反,1-DV布局是指一个客户端进程只能从单个存储服务器访问数据。而2-D布局介于1-DH布局和1-DV布局两者之间,这意味着2-D布局是指一个客户端进程从所有存储服务器的子集访问数据。In general, most PFS uses three typical data layouts: 1-DH, 1-DV, and 2-D. 1-DH layout means that a client process can access data from all storage servers. Contrary to the 1-DH layout, the 1-DV layout means that a client process can only access data from a single storage server. The 2-D layout is between the 1-DH layout and the 1-DV layout, which means that the 2-D layout refers to a client process accessing data from a subset of all storage servers.
首先,比较大多数PFS的数据布局方式与区域级文件系统的数据布局方式的区别,以执行两个并发读取访问一个9x大小的文件为例,两个并发读取被定义为第一读取71和第二读取72。第一读取71和第二读取72根据时间轴73进行读取。First, compare the difference between the data layout of most PFS and the data layout of the regional file system. Take two concurrent reads to access a 9x file as an example. Two concurrent reads are defined as the first read 71 and the second read 72. The first reading 71 and the second reading 72 perform reading according to the time axis 73.
当使用传统文件系统的数据布局方式时,如图1所示,文件被均匀地分区并存储在每个存储服务器上,其条带大小为3x。因此,当三个服务器在同一时间完成时,每个请求都要在3x的时间内完成,这两个读取请求总共需要6x的时间来完成。该种数据布局忽视了混合型存储系统中不同存储介质间的差异,使得较高性能的存储介质的读写效率不能充分体现,可以理解为,较高性能的存储介质被强制降级使用。When using the data layout of the traditional file system, as shown in Figure 1, the file is evenly partitioned and stored on each storage server, with a stripe size of 3x. Therefore, when the three servers complete at the same time, each request must be completed within 3x time, the two read requests take a total of 6x time to complete. This kind of data layout ignores the differences between different storage media in the hybrid storage system, so that the read and write efficiency of higher performance storage media cannot be fully reflected. It can be understood that higher performance storage media are forced to be degraded.
如图2所示,对于区域级文件系统来说,其数据布局方案有着明显的优势,区域级文件系统RLFS将文件7分为两个区域,两个区域分别为第一区域73和第二区域74,每个区域使用自身对应的条带大小(x或2x)在所有服务器 上进行分区,第二读取72会被分为两部分读取,总时间对应的是读取3x(3x=x+2x)的时间,但就第二读取72的时间来看,两种参与比较的数据布局方式相同,但是基于区域级文件系统的第一读取71时间减少到读取2x需要的时间。As shown in Figure 2, for the regional file system, the data layout scheme has obvious advantages. The regional file system RLFS divides the file 7 into two regions, the first region 73 and the second region 74, each area is partitioned on all servers using its own corresponding stripe size (x or 2x), the second read 72 will be divided into two parts to read, the total time corresponds to read 3x (3x=x +2x) time, but in terms of the time of the second read 72, the two data layouts involved in the comparison are the same, but the first read 71 time based on the regional file system is reduced to the time required to read 2x.
从这个例子中可以发现,RLFS中的区域方案与传统的数据布局相比,是一种更细粒度、更自适应的数据布局方案,在所有存储服务器中都对应有不同的条带大小。因此,RLFS中的区域方案可以看作是1-DH布局方案的一个变体,RLFS能够聚合所有存储服务器的带宽,以最大限度地提高I/O性能。RLFS十分匹配混合型存储系统100。From this example, it can be found that the regional scheme in RLFS is a more fine-grained and more adaptive data layout scheme than the traditional data layout, and corresponds to different stripe sizes in all storage servers. Therefore, the regional scheme in RLFS can be seen as a variant of the 1-DH layout scheme. RLFS can aggregate the bandwidth of all storage servers to maximize I/O performance. RLFS matches the hybrid storage system 100 very well.
RLFS旨在通过使用不同大小的文件条来支持基于区域的数据布局。为了同时适应混合型存储系统100和复杂的I/O工作负载,RLFS采用了分区处理方式来实现最优的数据布局。RLFS中会生成成本模型,根据成本模型,RLFS将一个大文件划分为一组区域,每个区域单独存放自身的条带大小。当应用的所有I/O请求的总成本最小化时,得到最优区域以及它们的条带尺寸。RLFS aims to support area-based data layout by using file strips of different sizes. In order to adapt to the hybrid storage system 100 and complex I/O workloads at the same time, RLFS uses a partitioned processing method to achieve the optimal data layout. A cost model is generated in RLFS. According to the cost model, RLFS divides a large file into a set of regions, and each region stores its own strip size separately. When the total cost of all I/O requests of the application is minimized, the optimal regions and their stripe sizes are obtained.
而本实施例中所涉及的存储系统为混合型存储系统100,混合型存储系统100包括基于固态驱动器的服务器102和基于硬盘驱动器的服务器101,基于固态驱动器的服务器简称SServer,基于硬盘驱动器的服务器HServer。The storage system involved in this embodiment is a hybrid storage system 100. The hybrid storage system 100 includes a solid-state drive-based server 102 and a hard disk drive-based server 101, a solid-state drive-based server referred to as SServer, and a hard disk drive-based server. HServer.
本发明实施例提供了一种文件系统,该文件系统称为区域级文件系统,即Region Level File System,简称RLFS。该文件系统能够支持区域级数据布局,并解决现有并行文件系统中出现的数据分布问题。RLFS依赖于定义的成本模型以及基于每个区域的异构感知方案来确定每个服务器的最优文件条带大小,并且进一步利用改变的访问模式来调整在运行时的区域方案。An embodiment of the present invention provides a file system, which is called a region-level file system, that is, Region Level File System, or RLFS for short. The file system can support regional data layout and solve the data distribution problem in the existing parallel file system. RLFS relies on a defined cost model and a heterogeneous sensing scheme based on each region to determine the optimal file stripe size for each server, and further uses the changed access mode to adjust the regional scheme at runtime.
更具体地说,首先为RLFS开发一个成本模型来估计区域访问的完成时间, 从而利用动态规划将文件划分为细粒度区域,然后对于HDD和SSD服务器分配每个区域所选择的最优文件条带大小。RLFS是存储系统和应用程序感知的,RLFS本质上代表了从传统的一维固定条带尺寸布局到二维变化的条带尺寸布局方式的改变,RLFS能够很好地适应服务器性能和应用行为的变化。此外,RLFS还根据所检测到的访问模式的变化,对生成的数据布局方案进行更新,以解决静态数据布局问题,使其更适合于运行时的文件访问。More specifically, first develop a cost model for RLFS to estimate the completion time of area access, so as to divide the file into fine-grained areas using dynamic programming, and then allocate the optimal file strips selected by each area to the HDD and SSD servers size. RLFS is storage system and application-aware. RLFS essentially represents a change from the traditional one-dimensional fixed stripe size layout to a two-dimensionally changing stripe size layout. RLFS can adapt well to server performance and application behavior. Variety. In addition, RLFS also updates the generated data layout scheme based on the detected change in access mode to solve the static data layout problem, making it more suitable for file access at runtime.
如图3所示,本发明实施例提供了一种文件系统,该文件系统称为区域级文件系统,即Region Level File System,简称RLFS。该文件系统能够支持区域级数据布局,并解决现有并行文件系统中出现的数据分布问题。As shown in FIG. 3, an embodiment of the present invention provides a file system, which is called a regional file system, that is, Region Level File System, abbreviated as RLFS. The file system can support regional data layout and solve the data distribution problem in the existing parallel file system.
RLFS包内核部分和用户级的守护进程模块20,优选地,内核部分包括FUSE模块10。换言之,本发明实施例提供的一种文件系统RLFS优选是基于FUSE框架设计。FUSE是指用户空间文件系统,其是Filesystem in Userspace的缩写。内核部分优选为Linux内核模块,内核部分还包括VFS模块11,VFS为虚拟文件系统,其是Virtual File System的简称。VFS模块11用于注册RLFS,内核部分中会创建块设备,块设备充当守护进程模块20与内核部分的接口,FUSE模块10充当守护进程模块20的代理,用于应用程序发出的各种文件系统请求。The kernel part of the RLFS package and the user-level daemon module 20. Preferably, the kernel part includes the FUSE module 10. In other words, the file system RLFS provided by the embodiments of the present invention is preferably designed based on the FUSE framework. FUSE refers to the user space file system, which is an abbreviation of Filesystem in Userspace. The kernel part is preferably a Linux kernel module, and the kernel part further includes a VFS module 11, which is a virtual file system, which is an abbreviation of Virtual File System. The VFS module 11 is used to register RLFS. A block device is created in the kernel part. The block device acts as an interface between the daemon process module 20 and the kernel part. The FUSE module 10 acts as an agent of the daemon process module 20 for various file systems issued by the application. request.
来自客户端1的应用程序可以通过将RLFS挂载到其名称空间的方式来访问RLFS,此后,所有针对挂载点的文件系统调用都通过VFS模块11转发到FUSE模块10。然后,FUSE模块10通过块设备将请求队列中的调用指令中继到守护进程模块20,其中,通过联系元数据服务器200和或其他存储服务器,调用适当的服务处理程序以适应文件系统调用。响应沿着反向路径通过内核部分传播,并最终传播回应用程序,应用程序在发出请求后通常处于等待状 态,等待响应。RLFS的守护进程和存储服务器应该完成PFS的所有语义。例如,读取处理程序应该首先识别哪些存储服务器具有所请求的数据段,以及每个服务器中哪些存储了相应的数据段,然后向这些服务器发出并行访问的子请求。内核部分还包括文件日志模块12,用于记录针对元数据服务器200的操作日志。The application program from client 1 can access RLFS by mounting RLFS into its name space, and thereafter, all file system calls directed to the mount point are forwarded to FUSE module 10 through VFS module 11. Then, the FUSE module 10 relays the call instruction in the request queue to the daemon module 20 through the block device, wherein, by contacting the metadata server 200 and or other storage server, an appropriate service processing program is called to adapt to the file system call. The response propagates through the kernel part along the reverse path and eventually propagates back to the application. The application is usually in a waiting state after making a request, waiting for a response. The RLFS daemon and storage server should complete all PFS semantics. For example, the read handler should first identify which storage servers have the requested data segment, and which server stores the corresponding data segment, and then issue sub-requests to these servers for parallel access. The kernel part also includes a file log module 12 for recording operation logs for the metadata server 200.
除了PFS的一般语义外,RLFS还需要实现基于区域的数据布局功能。为了实现这个目标,RLFS装备了具有三个用户级组件的I/O示踪器3、成本计算模块4和区域划分模块5,RLFS通过三个用户级组件来完成一个三相数据布局周期。数据布局周期从跟踪阶段开始,在跟踪阶段,I/O示踪器3在应用程序执行期间将数据访问的运行时统计信息以及用于成本建模的文件系统概要(例如,FUSE队列信息)收集到跟踪文件中。然后,I/O示踪器3将读/写迹线馈送到区域划分模块5,并且在下一个分析阶段I/O示踪器3将文件系统配置文件定向到成本计算模块4,区域划分模块5利用更新的成本模型来生成区域,每个区域都为两种服务器分配其自身的条带尺寸。最后,在放置阶段,在运行时将文件放置在底层混合型存储系统100上,以便根据上一阶段获得的布局方案优化在后的运行中的I/O请求。通过这三个阶段,RLFS可以大大提高应用程序在后续运行中的I/O性能。RLFS中还包括更新数据布局模块8,所述更新数据布局模块分别与守护进程模块20、I/O示踪器3、区域划分模块5和混合型存储系统100连接,所述更新数据布局模块用于动态更新数据布局,所述更新数据布局模块用于动态检测和更新区域变化。。进一步,将I/O示踪器3、成本计算模块4和区域划分模块5的具体功能分别阐述:In addition to the general semantics of PFS, RLFS also needs to implement region-based data layout functions. To achieve this goal, RLFS is equipped with an I/O tracer 3 with three user-level components, a cost calculation module 4 and an area division module 5. RLFS completes a three-phase data layout cycle through three user-level components. The data layout cycle starts from the tracking phase. During the tracking phase, the I/O tracer 3 collects the runtime statistics of data access and the summary of the file system used for cost modeling (for example, FUSE queue information) during application execution. Into the trace file. Then, the I/O tracer 3 feeds the read/write trace to the area division module 5, and in the next analysis stage the I/O tracer 3 directs the file system configuration file to the cost calculation module 4, the area division module 5 The updated cost model is used to generate regions, each of which allocates its own stripe size to the two servers. Finally, in the placement phase, the files are placed on the underlying hybrid storage system 100 at runtime, so as to optimize subsequent I/O requests according to the layout plan obtained in the previous phase. Through these three stages, RLFS can greatly improve the I/O performance of the application in subsequent operations. RLFS also includes an updated data layout module 8, which is connected to the daemon module 20, the I/O tracer 3, the area division module 5, and the hybrid storage system 100, respectively. The updated data layout module is used In order to dynamically update the data layout, the update data layout module is used to dynamically detect and update area changes. . Further, the specific functions of the I/O tracer 3, the cost calculation module 4 and the area division module 5 are separately explained:
一、I/O示踪器31. I/O Tracer 3
I/O示踪器3在RLFS中既用于收集运行时I/O信息,还用于收集文件系 统配置文件。虽然现有技术中有一些可用于I/O数据收集的技术和工具,例如IOSIG[42],但鉴于本发明实施例提供的文件系统是基于FUSE框架设计的,类似IOSIG[42]的现有技术中的I/O数据收集工具不能直接适用于RLFS。这是FUSE框架结构固有特性决定的。因此,在本实施例所涉及的I/O示踪器3,其遵循N-1日志模式,所有的RLFS守护进程都被用来写入单个文件共享文件。因此,设计的I/O示踪器3可以帮助收集I/O操作的所有信息,包括文件访问类型、操作时间和其他与进程相关的数据。I/O tracer 3 is used in RLFS to collect both runtime I/O information and file system configuration files. Although there are some technologies and tools in the existing technology that can be used for I/O data collection, such as IOSIG [42], given that the file system provided by the embodiment of the present invention is designed based on the FUSE framework, similar to the existing IOSIG [42] The I/O data collection tools in the technology cannot be directly applied to RLFS. This is determined by the inherent characteristics of the FUSE framework structure. Therefore, in the I/O tracer 3 involved in this embodiment, which follows the N-1 log mode, all RLFS daemons are used to write a single file shared file. Therefore, the designed I/O tracer 3 can help to collect all information of I/O operations, including file access type, operation time, and other process-related data.
使用I/O示踪器3运行相应的应用程序之后,可以获得进程ID、文件描述符、操作类型、偏移量、请求大小和时间戳信息。为了便于进一步的区域划分以及指导最优数据布局,文件的所有I/O请求都按其偏移量的升序排序。After running the corresponding application program using the I/O tracer 3, the process ID, file descriptor, operation type, offset, request size, and timestamp information can be obtained. In order to facilitate further area division and guide the optimal data layout, all I/O requests of the file are sorted in ascending order of their offsets.
运行时I/O信息是在特定环境下收集的,也可以用一些参数来充分了解收集到的I/O信息。为此,除了I/O信息之外,还应允许I/O示踪器3进一步收集关于文件系统的运行时配置文件,尤其是基于FUSE框架下的文件系统的运行时配置文件,该配置文件将定向到成本计算模块4中,辅助更新的成本模型,区域划分模块5会进一步根据成本计算模块4获得的最小总成本来确定最优的区域划分。I/O information is collected in a specific environment during runtime, and some parameters can also be used to fully understand the collected I/O information. To this end, in addition to I/O information, I/O tracer 3 should be allowed to further collect runtime configuration files about the file system, especially the runtime configuration files based on the file system under the FUSE framework, the configuration file It will be directed to the cost calculation module 4 to assist the updated cost model, and the area division module 5 will further determine the optimal area division according to the minimum total cost obtained by the cost calculation module 4.
二、成本计算模块4Second, the cost calculation module 4
成本计算模块4能够生成成本模型,且成本模型以寻找到最小总成本为目标。The cost calculation module 4 can generate a cost model, and the cost model aims to find the minimum total cost.
为了获得存储系统中每个服务器的最优区域划分及其条带大小,本发明实施例中提出的文件系统需要依赖成本计算模块4。在成本计算模块4中,成本被定义为每个文件请求的I/O完成时间。成本计算模块4用于计算文件系统中文件请求访问的成本。该文件系统是与混合型存储系统相匹配的。In order to obtain the optimal area division and stripe size of each server in the storage system, the file system proposed in the embodiment of the present invention needs to rely on the cost calculation module 4. In the cost calculation module 4, the cost is defined as the I/O completion time of each file request. The cost calculation module 4 is used to calculate the cost of file request access in the file system. The file system is compatible with the hybrid storage system.
由于文件请求的访问总成本与文件系统本身及底层网络和存储服务器相关,所以,文件请求的访问总成本包括文件系统的系统成本及网络和存储成本。因此,成本计算模块4的计算依据就应该包括文件系统的系统成本及网络和存储成本。对应地,成本计算模块4包括系统成本计算模块41及网络和存储成本计算模块42。Since the total access cost of a file request is related to the file system itself and the underlying network and storage server, the total access cost of a file request includes the system cost and network and storage costs of the file system. Therefore, the calculation basis of the cost calculation module 4 should include the system cost of the file system and the network and storage costs. Correspondingly, the cost calculation module 4 includes a system cost calculation module 41 and a network and storage cost calculation module 42.
由于本发明实施例提出的一种文件系统是建立在在FUSE框架之上的,所以,文件系统的系统成本主要是指FUSE数据路径中的时间开销。由于RLFS的主要目标是通过数据文件在混合型存储系统100上的最优位置来优化读取请求,因此本实施例中只定义关于读取请求的系统成本,写入请求的成本也可以通过遵循相同的参数来导出。Since the file system proposed in the embodiment of the present invention is built on the FUSE framework, the system cost of the file system mainly refers to the time overhead in the FUSE data path. Since the main goal of RLFS is to optimize the read request through the optimal position of the data file on the hybrid storage system 100, only the system cost related to the read request is defined in this embodiment, and the cost of the write request can also be followed by Export with the same parameters.
如图4所示,对于每个读取请求,其服务时间被分成三个子部分,其一为FUSE模块10中排队等待的时间,其二为FUSE模块10和守护进程模块20之间进行两个上下文切换的时间,其三为第一次复制所收集到的三个复制操作的时间。As shown in FIG. 4, for each read request, the service time is divided into three sub-parts, one is the waiting time in the FUSE module 10, and the other is the two between the FUSE module 10 and the daemon module 20. The time of context switching, the third is the time of the three copy operations collected in the first copy.
数据从含有m个HServer和n个SServer的网络系统流向守护进程模块20,然后再从守护进程模块20到FUSE模块10中,最后由FUSE模块10发送到客户端1。Data flows from the network system containing m HServers and n SServers to the daemon module 20, and then from the daemon module 20 to the FUSE module 10, and finally sent to the client 1 by the FUSE module 10.
在FUSE模块10队列中等待读取请求的时间与客户端1和RLFS之间运行的应用程序密切相关,FUSE模块10队列中等待读取请求的时间不仅取决于应用程序做出的I/O请求模式,其还与由文件系统引起的其他因素有关,如页面缓存或中断等。因此,很难准确地估计它。然而,当考虑到通过RLFS的守护进程模块20的多线程支持来最小化队列延迟这一因素,可以安全地假设FUSE模块10队列中读取请求的等待时间可以忽略不计,即T q=0。 The time to wait for a read request in the FUSE module 10 queue is closely related to the application running between the client 1 and RLFS. The time to wait for the read request in the FUSE module 10 queue depends not only on the I/O request made by the application Mode, which is also related to other factors caused by the file system, such as page caching or interruption. Therefore, it is difficult to estimate it accurately. However, when considering the factor of minimizing the queue delay through the multi-thread support of the RLFS daemon module 20, it can be safely assumed that the waiting time of the read request in the queue of the FUSE module 10 is negligible, that is, Tq =0.
进一步,上下文切换时间是系统相关的,并且独立于数据大小,可以将其视为一个常量值。所以,FUSE模块10和守护进程模块20之间进行两个上下文切换的时间T s的计算公式为:T s=2μ。其中μ是上下文切换时间。 Further, the context switching time is system-dependent and independent of the data size, it can be regarded as a constant value. Therefore, the calculation formula of the time T s for switching between two contexts between the FUSE module 10 and the daemon module 20 is: T s = 2 μ. Where μ is the context switching time.
第一次复制所收集到的三个复制操作的时间,简称复制时间T c,复制时间T c与文件请求的数据大小r成正比,其计算公式为: The first copy of the collected time copy operation of three, referred to as reproduction time T c, r is proportional to the size of the data reproduction time T c with the requested file, which is calculated as:
T c(r,h,s)=3rt c T c (r,h,s) = 3rt c
文件请求的数据大小为r,文件请求的数据大小r的计算公式是:The file request data size is r, and the calculation formula for the file request data size r is:
r=ms m+ns n r=ms m +ns n
s m和s n分别代表HServer上最大的子请求大小和SServer上的最大子请求大小,且s m≤h且s n≤s,h表示HServer上条带尺寸,s表示SServer上条带尺寸,所以,进一步,复制时间T c可以表示为: s m and s n represent the maximum sub-request size on HServer and the maximum sub-request size on SServer, and s m ≤h and s n ≤s, h represents the stripe size on HServer, s represents the stripe size on SServer, So, further, the replication time T c can be expressed as:
T c(r,h,s)≈3(mh+ns)t c T c (r,h,s)≈3(mh+ns)t c
t c是从内核空间到用户空间的单元数据复制时间。因此,由系统成本计算模块41计算出的总成本的第一部分表示为T 1,T 1=T s+T ct c is the copy time of unit data from kernel space to user space. Therefore, the first part of the total cost calculated by the system cost calculation module 41 is represented as T 1 , T 1 =T s +T c .
而由网络和存储成本计算模块42计算的网络和存储成本包括:网络连接时间T e、存储访问时间T a和网络传输时间T x。在PFS中,请求会被划分为一组子任务,每个子任务转发到单独的存储服务器以供并行执行。所以,网络和存储服务器中的请求子部件成本由所有子请求的最大成本确定。假定每类服务器(HServer或SServer)对于网络和存储具有相同的配置,就可以根据数据大小(s m和s n)和数据传输网络时间t确定网络传输时间T x,具体公式为: While the network by the network computing and storage costs and storage costs calculation module 42 comprises: a network connection time T e, T a memory access times and network transmission time T x. In PFS, requests are divided into a set of subtasks, and each subtask is forwarded to a separate storage server for parallel execution. Therefore, the cost of request subcomponents in the network and storage server is determined by the maximum cost of all subrequests. Assuming that each type of server (HServer or SServer) has the same configuration for the network and storage, the network transmission time T x can be determined according to the data size (s m and s n ) and the data transmission network time t. The specific formula is:
Figure PCTCN2019121301-appb-000009
Figure PCTCN2019121301-appb-000009
上式中s m和s n分别代表HServer上最大的子请求大小和SServer上的最大 子请求大小。 And wherein the s m s n represent the largest sub-request HServer the maximum size of the child and SServer request size.
与网络传输时间T x类似,存储访问时间T a由子请求决定,具体公式为: Similar to the network transmission time T x , the storage access time T a is determined by the sub-request. The specific formula is:
Figure PCTCN2019121301-appb-000010
Figure PCTCN2019121301-appb-000010
上式中s m和s n分别代表HServer上最大的子请求大小和SServer上的最大子请求大小,t h和t s分别表示HServer上单元数据传输时间和SServer上单元数据传输时间。 In the above formula, s m and s n represent the maximum sub-request size on the HServer and the maximum sub-request size on the SServer, respectively. t h and t s represent the unit data transmission time on the HServer and the unit data transmission time on the SServer.
而与存储访问时间T a和网络传输时间T x不同,网络连接时间T e为常数,其与数据大小无关。综上,由网络和存储成本计算模块42计算的网络和存储成本间T 2可以通过公式表示: With the memory access time T a and T x different network transmission time, the network connection time T e is a constant, independent of the data size. In summary, the network and storage cost T 2 calculated by the network and storage cost calculation module 42 can be expressed by the formula:
Figure PCTCN2019121301-appb-000011
Figure PCTCN2019121301-appb-000011
进一步网络和存储成本时间T 2可以表示为: Further network and storage cost time T 2 can be expressed as:
T 2≈T e+max{h(t h+t),s(t s+t)} T 2 ≈T e +max{h(t h +t),s(t s +t)}
上式中h表示HServer上条带尺寸,s表示SServer上条带尺寸。In the above formula, h indicates the strip size on the HServer, and s indicates the strip size on the SServer.
从成本计算模块4中可以看出,请求的总成本T可以表示为:T=T 1+T 2,请求的总成本是描述应用程序、文件系统和数据布局的参数的函数。因此,它是高度异质性的,由服务器条带大小h和s决定的。 It can be seen from the cost calculation module 4 that the total cost T of the request can be expressed as: T=T 1 +T 2 , and the total cost of the request is a function of parameters describing the application, file system, and data layout. Therefore, it is highly heterogeneous, determined by the server stripe size h and s.
另外,需要说明的是,由于在SServers中的读与写有很大的不同,写入请求所涉及的操作比读要多,此时,FUSE模块10和守护进程模块20之间进行两个上下文切换的时间T s需要加入写入放大、垃圾收集和磨损均衡的时间。 In addition, it should be noted that, since the read and write in SServers are very different, the write request involves more operations than read. At this time, two contexts are performed between the FUSE module 10 and the daemon module 20 The switching time T s needs to include the time for write amplification, garbage collection and wear leveling.
为了便于阐明成本计算模块4的工作原理,此处将成本计算模块4中涉及的成本分析模式的参数以表格形式展现,如表一所示。In order to facilitate the explanation of the working principle of the cost calculation module 4, the parameters of the cost analysis mode involved in the cost calculation module 4 are presented in a table form, as shown in Table 1.
Figure PCTCN2019121301-appb-000012
Figure PCTCN2019121301-appb-000012
表一成本分析模式中的参数Table 1 Parameters in the cost analysis mode
三、区域划分模块5Third, the area division module 5
通过成本计算模块4生成的成本模型的指导,区域划分模块5能够将文件划分到不同的区域,试图最小化以并行应用程序为特征的给定访问集合的总成本。现有的区域划分装置有HARL,而HARL是将区域划分和条带大小确定分两个不同的阶段来处理,与HARL不同,RLFS的布局策略是整体的,RLFS的布局策略以一种统一的方式考虑区域划分和条带大小确定问题,所以,RLFS能够一次确定区域划分和条带大小。RLFS不像HARL那样以启发式方式扫描跟踪文件以查找逻辑区域,而是将逻辑区域和物理块放在一起,以最小的总成本为目标。这种考虑很容易理解,因为文件访问的最小单元是块,例如64MB 或128MB,且逻辑区域可以自然地跨越相邻物理块的序列。Guided by the cost model generated by the cost calculation module 4, the area division module 5 can divide the file into different areas, trying to minimize the total cost of a given access set featuring parallel applications. The existing area division device has HARL, and HARL divides the area division and stripe size determination into two different stages to deal with. Unlike HARL, the layout strategy of RLFS is integrated, and the layout strategy of RLFS is a unified The method considers the problem of area division and stripe size determination, so RLFS can determine area division and stripe size at a time. RLFS does not scan trace files in a heuristic way to find logical regions like HARL, but puts logical regions and physical blocks together with the goal of minimizing the total cost. This consideration is easy to understand because the smallest unit of file access is a block, such as 64MB or 128MB, and the logical area can naturally span a sequence of adjacent physical blocks.
在区域划分模块5中能够执行第一算法,第一算法是一种离线形式的最有相对快速算法,第一算法可以周期性地重复以适应访问的动态特性。“相对快速”意味着算法是伪多项式时间,该算法的实质是首先将共享文件表示为块序列,然后根据给定的访问请求以块为单位对文件进行分区,最后利用动态规划模块从这些分区中找到最优的区域划分。The first algorithm can be executed in the area division module 5. The first algorithm is an offline form of the most relatively fast algorithm. The first algorithm can be repeated periodically to adapt to the dynamic characteristics of the access. "Relatively fast" means that the algorithm is pseudo-polynomial time. The essence of the algorithm is to first represent the shared file as a sequence of blocks, then partition the file in blocks according to the given access request, and finally use the dynamic programming module to partition from these partitions. Find the optimal area division.
根据访问模式给出的I/O事件,例如开始或结束I/O操作,文件F具有L的大小的示例,由段数(L=12段)定义,并且相邻段的序列被合并为区域,每个区域被红色的垂直虚线隔离。每个区域在HServer和SServer之间的数据都是条带的,逻辑I/O请求可以由单个针对与请求的数据有关的多个物理请求来处理。通过这种布局优化,与传统策略相比,总访问的成本根据定义的成本模型被最小化。According to the I/O events given by the access mode, such as starting or ending an I/O operation, the file F has an example of the size of L, defined by the number of segments (L=12 segments), and the sequence of adjacent segments is merged into regions, Each area is separated by a red vertical dotted line. The data between HServer and SServer in each area is striped, and logical I/O requests can be processed by a single multiple physical requests related to the requested data. With this layout optimization, the total access cost is minimized according to the defined cost model compared to traditional strategies.
Figure PCTCN2019121301-appb-000013
表示从索引i开始的具有l请求事件的文件被划分为k区域时的最小成本,由以下递归来计算0≤i<l的:
Assume
Figure PCTCN2019121301-appb-000013
Represents the minimum cost when a file with a l request event starting at index i is divided into k regions, and the following recursion is used to calculate 0≤i<l:
Figure PCTCN2019121301-appb-000014
Figure PCTCN2019121301-appb-000014
其中,
Figure PCTCN2019121301-appb-000015
定义了一个区域,其大小为
Figure PCTCN2019121301-appb-000016
可以表示为:
among them,
Figure PCTCN2019121301-appb-000015
An area is defined with a size of
Figure PCTCN2019121301-appb-000016
It can be expressed as:
Figure PCTCN2019121301-appb-000017
Figure PCTCN2019121301-appb-000017
Figure PCTCN2019121301-appb-000018
将在HServer和SServer中被条带化,分别为h i和s i
Figure PCTCN2019121301-appb-000018
Will be striped and SServer in HServer, respectively, and h i s i.
从递归中可以获得从事件i开始,将l个事件划分为k区域的最小成本
Figure PCTCN2019121301-appb-000019
当m从l变化到l-i时,将尺寸为f的第一区域的成本
Figure PCTCN2019121301-appb-000020
相加 到剩余的
Figure PCTCN2019121301-appb-000021
中,从而计算出最小和。当段的数目不足以支持剩下的k区域划分,设置
Figure PCTCN2019121301-appb-000022
否则设置
Figure PCTCN2019121301-appb-000023
From recursion, we can get the minimum cost of dividing l events into k regions starting from event i
Figure PCTCN2019121301-appb-000019
When m changes from l to li, the cost of the first region of size f
Figure PCTCN2019121301-appb-000020
Add to the rest
Figure PCTCN2019121301-appb-000021
To calculate the minimum sum. When the number of segments is insufficient to support the remaining k-region division, set
Figure PCTCN2019121301-appb-000022
Otherwise set
Figure PCTCN2019121301-appb-000023
给出
Figure PCTCN2019121301-appb-000024
的定义后,进一步计算从
Figure PCTCN2019121301-appb-000025
Figure PCTCN2019121301-appb-000026
区域的(子)请求,然后计算
Figure PCTCN2019121301-appb-000027
的成本
Figure PCTCN2019121301-appb-000028
计算公式如下:
Given
Figure PCTCN2019121301-appb-000024
After the definition of
Figure PCTCN2019121301-appb-000025
To
Figure PCTCN2019121301-appb-000026
Regional (sub)requests, then calculate
Figure PCTCN2019121301-appb-000027
the cost of
Figure PCTCN2019121301-appb-000028
Calculated as follows:
Figure PCTCN2019121301-appb-000029
Figure PCTCN2019121301-appb-000029
上公式中T(r,h i,s i)是在成本模型中定义的,假设s i=αh i,那么有: The equation T (r, h i, s i) are defined in the cost model, we assumed that s i = αh i, then there is:
Figure PCTCN2019121301-appb-000030
Figure PCTCN2019121301-appb-000030
这里,α≥1是SServer相对于HServer的扩展因子,B表示配置中的块大小。Here, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.
通过上述的4个方程,可以得到文件布局的最优区域划分并且最小化给定请求的成本。进一步,文件被分区放置在底层异构服务器上,底层异构服务器上每个区域对应有确定的条带大小。此后,对于每个请求R,可根据其条带大小读取相应区域,从而满足要求。Through the above four equations, you can get the optimal area division of the file layout and minimize the cost of a given request. Further, the files are partitioned and placed on the underlying heterogeneous server, and each area on the underlying heterogeneous server corresponds to a certain stripe size. After that, for each request R, the corresponding area can be read according to its stripe size to meet the requirements.
图5为本发明实施例中文件系统的一个应用示例图。如图5所示,文件客户端1代表来自计算服务器301的应用程序发出请求,混合型存储系统100负责存储和管理已剥离的区域,元数据服务器200(MDS)包含存储在RLFS中的文件的描述信息。在文件操作期间,客户端1首先联系MDS以获取文件元数据,然后利用它通过RLFS守护进程与混合型存储系统100进行数据访问。FIG. 5 is an application example diagram of a file system in an embodiment of the present invention. As shown in FIG. 5, the file client 1 issues a request on behalf of the application program from the computing server 301, the hybrid storage system 100 is responsible for storing and managing the stripped area, and the metadata server 200 (MDS) contains the files stored in the RLFS. Description. During file operations, the client 1 first contacts the MDS to obtain file metadata, and then uses it to perform data access with the hybrid storage system 100 through the RLFS daemon.
RLFS将一个大文件逻辑地映射到多个小(区域)文件中,每个文件代表一个具有类似I/O工作负载的文件区域。区域文件被进一步剥离在所有HServer和SServer上,并且每个条带作为单独的数据文件存储在每个存储服务器中。为此,MDS为RLFS中的每个物理文件维护一个区域条形表(RST),如下表 二所示,其中按照每个服务器里的偏移量和条带大小来记录文件的每个区域。当文件被写入RLFS时,区域条形表(RST)由区域划分模块5创建,当访问模式改变时更新区域条形表(RST)。为了提高效率,可以在安装和卸载RLFS时将要读取的文件的RST缓存和解缓存储在与应用程序相同的目录中。RLFS logically maps a large file to multiple small (area) files, and each file represents a file area with a similar I/O workload. The zone files are further stripped on all HServers and SServers, and each stripe is stored as a separate data file in each storage server. To this end, MDS maintains an area strip table (RST) for each physical file in RLFS, as shown in Table 2 below, where each area of the file is recorded according to the offset and stripe size in each server. When the file is written to RLFS, the area bar table (RST) is created by the area division module 5, and the area bar table (RST) is updated when the access mode is changed. To improve efficiency, you can store the RST cache and resolution of files to be read in the same directory as the application when installing and uninstalling RLFS.
Figure PCTCN2019121301-appb-000031
Figure PCTCN2019121301-appb-000031
表二区域条形表数据结构Table 2 Regional Bar Table Data Structure
对于成本计算模块4,则使用并行文件系统中的一个文件服务器来测试具有读/写模式的HServer和SServer的上下文切换时间、单位数据复制时间和单位数据传输时间,这些参数可以随不同的I/O模式而变化。此外,使用一对节点,即一个客户端节点和一个文件服务器,来估计网络传输时间,网络传输时间测试可重复测试数千次,然后计算它们的平均值,作为生成成本模型的参数值。For the cost calculation module 4, a file server in the parallel file system is used to test the context switching time, unit data copy time and unit data transfer time of HServer and SServer with read/write mode. These parameters can vary with different I/ O mode. In addition, a pair of nodes, a client node and a file server, are used to estimate the network transmission time. The network transmission time test can be repeated thousands of times, and then the average of them is calculated as the parameter value of the generated cost model.
为特定文件执行最优数据布局,RLFS首先使用其区域划分模块5来计算文件的最优区域划分,然后利用成本模型和I/O跟踪数据来确定每个区域的条带大小。计算最优区域信息用于每个服务器上同时写入文件,区域划分模块5为MDS中的文件创建用于后续读取的RST。MDS容纳RLFS的命名空间、RST以及有关每个文件的其他信息。然而,由于区域划分算法给定的地区数量 有限,MDS的大小受高度的控制的,MDS的大小较小。To perform the optimal data layout for a specific file, RLFS first uses its area division module 5 to calculate the optimal area division of the file, and then uses the cost model and I/O tracking data to determine the stripe size of each area. The optimal area information is calculated for writing files on each server at the same time, and the area dividing module 5 creates an RST for subsequent reading of the files in the MDS. MDS holds the RLFS namespace, RST, and other information about each file. However, due to the limited number of regions given by the area division algorithm, the size of the MDS is highly controlled, and the size of the MDS is small.
另外,为了便于对每个文件的并行读取,混合型存储系统100维护了一个平面名称空间,其中每个文件可以通过本地磁盘中的“filename_region#_stripe#”来标识。注意“filename”可以包含应用程序指定的路径信息。后台I/O守护进程用于接收来自客户端1的传入请求,其特点是“filename”“region#”和“stripe#”,通过发送回请求的条带文件来对请求进行服务,这些条带文件与其他条带文件结合起来以满足应用程序的需要。In addition, in order to facilitate parallel reading of each file, the hybrid storage system 100 maintains a flat namespace, where each file can be identified by "filename_region#_stripe#" in the local disk. Note that "filename" can contain path information specified by the application. The background I/O daemon is used to receive incoming requests from client 1, which is characterized by "filename", "region#" and "stripe#", and serves the request by sending back the requested stripe file. Band files are combined with other band files to meet the needs of the application.
本发明提出的一种文件系统(RLFS)通过将文件划分为一组最优区域的方式来支持区域级的数据布局,从而通过该文件系统能够确定最优区域及其条带大小,因此,通过该文件系统能够优化混合型存储系统100的数据布局。与内核方法相比,使用FUSE模块10不仅极大地简化了的开发工作,而且允许通过标准文件系统接口访问RLFS,可以使应用程序以透明的方式访问RLFS,可变大小的RLFS能够减轻服务器之间的负载不平衡,能够灵活地适应工作负载变化和服务器异构性,从而显著加快I/O系统性能。A file system (RLFS) proposed by the present invention supports region-level data layout by dividing a file into a set of optimal regions, so that the file system can determine the optimal region and its stripe size. Therefore, by This file system can optimize the data layout of the hybrid storage system 100. Compared with the kernel method, using the FUSE module 10 not only greatly simplifies the development work, but also allows access to RLFS through the standard file system interface, allowing applications to access RLFS in a transparent manner, and variable-size RLFS can ease the The load is unbalanced, which can flexibly adapt to workload changes and server heterogeneity, thereby significantly speeding up I/O system performance.
本发明实施例所提出的文件系统(RLFS)已经经过实验验证,确定可行,并且性能表现优秀。实验结果表明,RLFS能够很好配合混合型存储系统100一起运行,RLFS很大程度地提高并行I/O性能。The file system (RLFS) proposed in the embodiment of the present invention has been experimentally verified, determined to be feasible, and has excellent performance. Experimental results show that RLFS can work well with the hybrid storage system 100, and RLFS greatly improves parallel I/O performance.
在实验中,针对三种数据布局方案做了比较:方案一利用固定大小的条带;方案二利用随机选择的条带,方案三通过RLFS实现。对于读和写,RLFS分别使用{32KB,160KB}和{36KB,148KB}的最优数据布局,这与具有64KB的固定大小条带的默认布局相比,I/O性能提高了73.4%和176.7%。与其他具有不同但固定大小的条纹的布局相比,RLFS使读取性能提高到138.6%,写入性能提高到177.6%。与随机选择的条带策略相比,RLFS使读取性能提高到 154.5%,写入性能提高到215.4%。In the experiment, three data layout schemes were compared: scheme one uses fixed-size strips; scheme two uses randomly selected stripes, and scheme three uses RLFS. For reading and writing, RLFS uses the optimal data layout of {32KB, 160KB} and {36KB, 148KB}, respectively, which improves I/O performance by 73.4% and 176.7 compared to the default layout with 64KB fixed-size stripes. %. Compared with other layouts with different but fixed-size stripes, RLFS improves read performance to 138.6% and write performance to 177.6%. Compared with the randomly selected stripe strategy, RLFS improves read performance to 154.5% and write performance to 215.4%.
基于代表性基准的实验结果表明,RLFS是混合并行文件系统中的一个有前途和可行的解决方案,并行I/O性能从读取的20.6%提高到556.1%,写入22.7%提高到288.7%。Experimental results based on representative benchmarks show that RLFS is a promising and feasible solution in a hybrid parallel file system, with parallel I/O performance improved from 20.6% for reads to 556.1% and writes for 22.7% to 288.7% .
本发明还提供一种数据布局方法,其包括:The invention also provides a data layout method, which includes:
步骤S1,将运行时的数据访问的I/O信息以及用于成本建模的文件系统配置文件收集到跟踪文件中,将文件系统配置文件定向用于建立成本模型,将I/O信息用于区域划分;Step S1: Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division
步骤S2,计算或预估文件请求的访问成本,形成成本模型;Step S2, calculate or estimate the access cost of the file request to form a cost model;
步骤S3,根据所述成本模型以生成总成本最小化的分布区域,并将文件划分到的不同区域中;Step S3, generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;
步骤S4,获取所述区域对应的条带大小。Step S4: Obtain the stripe size corresponding to the area.
上述方法的有益效果在于,该方法通过将文件划分为一组最优区域的方式来支持区域级的数据布局,从而确定最优区域及其条带大小,该方法优化了混合型存储系统的数据布局。The beneficial effect of the above method is that this method supports regional data layout by dividing the file into a set of optimal regions to determine the optimal region and its stripe size. This method optimizes the data of the hybrid storage system layout.
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present invention, several improvements and retouches can be made. These improvements and retouches also It should be regarded as the protection scope of the present invention.

Claims (9)

  1. 一种文件系统,其特征在于,所述文件系统包括相互电性连接的I/O示踪器、成本计算模块和区域划分模块,所述I/O示踪器用于向所述区域划分模块提供自身收集到所述文件系统运行时的I/O信息;所述I/O示踪器还用于向所述成本计算模块提供自身收集到的所述文件系统的配置文件;所述成本计算模块用于计算或预估所述文件系统中文件请求的访问成本,以向所述区域划分模块输出成本模型;所述区域划分模块用于根据所述成本模型生成总成本最小化的分布区域,并将文件划分到的不同区域中,所述区域划分模块还用于获得所述区域对应的条带大小。A file system, characterized in that the file system includes an I/O tracer, a cost calculation module and an area division module electrically connected to each other, and the I/O tracer is used to provide the area division module I/O information collected by the file system during operation; the I/O tracer is also used to provide the cost calculation module with the configuration file of the file system collected by itself; the cost calculation module Used to calculate or estimate the access cost of file requests in the file system to output a cost model to the area division module; the area division module is used to generate a distribution area with a minimum total cost according to the cost model, and In different areas into which the file is divided, the area dividing module is also used to obtain the stripe size corresponding to the area.
  2. 如权利要求1所述的文件系统,其特征在于,所述文件系统还包括内核部分,所述内核部分用于执行元数据服务器、混合型存储系统和客户端三方之间的信息或数据的交互;所述内核部分包括FUSE模块。The file system according to claim 1, wherein the file system further comprises a kernel part, and the kernel part is used to perform information or data interaction between a metadata server, a hybrid storage system, and a client ; The core part includes a FUSE module.
  3. 如权利要求2所述的文件系统,其特征在于,所述文件系统还包括守护进程模块,所述守护进程模块用于在后台执行守护进程;所述FUSE模块用于作为所述守护进程的代理。The file system according to claim 2, wherein the file system further comprises a daemon process module, the daemon process module is used to execute the daemon process in the background; the FUSE module is used as an agent of the daemon process .
  4. 如权利要求3所述的文件系统,其特征在于,所述文件系统还包括更新数据布局模块,所述更新数据布局模块分别与所述守护进程模块、所述I/O示踪器、所述区域划分模块和所述混合型存储系统连接,所述更新数据布局模块用于动态检测和更新区域变化。The file system according to claim 3, wherein the file system further comprises an update data layout module, the update data layout module and the daemon module, the I/O tracer, the An area division module is connected to the hybrid storage system, and the update data layout module is used to dynamically detect and update area changes.
  5. 如权利要求4所述的文件系统,其特征在于,所述成本计算模块用于计算请求的总成本,总成本计算公式为:T=T s+T c+T 2,公式中,T s表示所述FUSE模块和所述守护进程模块之间进行两个上下文切换的时间,T c表示复制时 间,T 2表示网络和存储成本。 The file system according to claim 4, wherein the cost calculation module is used to calculate the total cost of the request, the total cost calculation formula is: T = T s + T c + T 2 , in the formula, T s represents The time for the two context switches between the FUSE module and the daemon module, T c represents the replication time, and T 2 represents the network and storage costs.
  6. 如权利要求5所述的文件系统,其特征在于,所述混合型存储系统包括包括基于固态驱动器的服务器SServer和基于硬盘驱动器的服务器HServer;The file system according to claim 5, wherein the hybrid storage system includes a server SServer based on a solid state drive and a server HServer based on a hard disk drive;
    所述复制时间的计算公式为:T c(r,h,s)≈3(mh+ns)t c,公式中t c表示从内核空间到用户空间的单元数据复制时间,h表示HServer上条带尺寸,s表示SServer上条带尺寸,m表示HServer的数量,n表示SServer的数量; The calculation formula of the replication time is: T c (r, h, s) ≈ 3 (mh+ns) t c , where t c represents the unit data replication time from kernel space to user space, and h represents HServer Band size, s indicates the strip size on SServer, m indicates the number of HServers, and n indicates the number of SServers;
    所述网络和存储成本的计算公式为:T 2≈T e+max{h(t h+t),s(t s+t)},公式中,t表示数据传输网络时间,t h和t s分别表示HServer上单元数据传输时间和SServer上单元数据传输时间,T e表示网络连接时间。 The calculation formula of the network and storage cost is: T 2 ≈T e +max{h(t h +t),s(t s +t)}, where t represents the data transmission network time, t h and t s represents the unit data transmission time on HServer and SServer, respectively, and T e represents the network connection time.
  7. 如权利要求3或5或6所述的文件系统,其特征在于,所述区域划分模用于获取从事件i开始将l个事件划分为k个区域的最小成本
    Figure PCTCN2019121301-appb-100001
    所述最小成本
    Figure PCTCN2019121301-appb-100002
    的计算公式为:
    The file system according to claim 3, 5 or 6, wherein the area division module is used to obtain a minimum cost for dividing 1 event into k areas starting from event i
    Figure PCTCN2019121301-appb-100001
    The minimum cost
    Figure PCTCN2019121301-appb-100002
    Is calculated as:
    Figure PCTCN2019121301-appb-100003
    Figure PCTCN2019121301-appb-100003
    公式中,
    Figure PCTCN2019121301-appb-100004
    定义了一个大小为
    Figure PCTCN2019121301-appb-100005
    区域,
    Figure PCTCN2019121301-appb-100006
    表示尺寸为f的第一区域的成本。
    formula,
    Figure PCTCN2019121301-appb-100004
    Defines a size as
    Figure PCTCN2019121301-appb-100005
    area,
    Figure PCTCN2019121301-appb-100006
    Represents the cost of the first area of size f.
  8. 如权利要求7所述的文件系统,其特征在于,基于固态驱动器的服务器和基于硬盘驱动器的服务器能够将
    Figure PCTCN2019121301-appb-100007
    条带化,并分别得到h i和s i,s i的计算公式为s i=αh i,h i的计算公式为:
    The file system according to claim 7, wherein the server based on the solid state drive and the server based on the hard disk drive can
    Figure PCTCN2019121301-appb-100007
    Striping, respectively and h i and s i, s i is calculated as s i = αh i, h i is calculated as:
    Figure PCTCN2019121301-appb-100008
    Figure PCTCN2019121301-appb-100008
    公式中,α≥1且是SServer相对于HServer的扩展因子,B表示配置中的块大小。In the formula, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.
  9. 一种数据布局方法,其特征在于,其包括:A data layout method is characterized in that it includes:
    步骤S1,将运行时的数据访问的I/O信息以及用于成本建模的文件系统配置文件收集到跟踪文件中,将文件系统配置文件定向用于建立成本模型,将I/O信息用于区域划分;Step S1: Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division
    步骤S2,计算或预估文件请求的访问成本,形成成本模型;Step S2, calculate or estimate the access cost of the file request to form a cost model;
    步骤S3,根据所述成本模型以生成总成本最小化的分布区域,并将文件划分到的不同区域中;Step S3, generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;
    步骤S4,获取所述区域对应的条带大小。Step S4: Obtain the stripe size corresponding to the area.
PCT/CN2019/121301 2018-12-18 2019-11-27 File system and data layout method WO2020125362A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811547400.9 2018-12-18
CN201811547400.9A CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method

Publications (1)

Publication Number Publication Date
WO2020125362A1 true WO2020125362A1 (en) 2020-06-25

Family

ID=66883264

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121301 WO2020125362A1 (en) 2018-12-18 2019-11-27 File system and data layout method

Country Status (2)

Country Link
CN (1) CN109840247B (en)
WO (1) WO2020125362A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840247B (en) * 2018-12-18 2020-12-18 深圳先进技术研究院 File system and data layout method
CN110825698B (en) * 2019-11-07 2021-02-09 重庆紫光华山智安科技有限公司 Metadata management method and related device
CN114578299A (en) * 2021-06-10 2022-06-03 中国人民解放军63698部队 Method and system for generating radio frequency signal by wireless remote control beacon device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1692356A (en) * 2002-11-14 2005-11-02 易斯龙系统公司 Systems and methods for restriping files in a distributed file system
US20090248756A1 (en) * 2008-03-27 2009-10-01 Akidau Tyler A Systems and methods for a read only mode for a portion of a storage system
CN102566942A (en) * 2011-12-28 2012-07-11 华为技术有限公司 File striping writing method, device and system
CN103778222A (en) * 2014-01-22 2014-05-07 浪潮(北京)电子信息产业有限公司 File storage method and system for distributed file system
WO2015153671A1 (en) * 2014-03-31 2015-10-08 Amazon Technologies, Inc. File storage using variable stripe sizes
CN109840247A (en) * 2018-12-18 2019-06-04 深圳先进技术研究院 File system and data layout method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005302152A (en) * 2004-04-12 2005-10-27 Sony Corp Composite type storage device, data writing method, and program
CN105872031B (en) * 2016-03-26 2019-06-14 天津书生云科技有限公司 Storage system
US9916311B1 (en) * 2013-12-30 2018-03-13 Emc Corporation Storage of bursty data using multiple storage tiers with heterogeneous device storage
CN104020961B (en) * 2014-05-15 2017-07-25 深信服科技股份有限公司 Distributed data storage method, apparatus and system
JP6346880B2 (en) * 2014-10-17 2018-06-20 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America recoding media
CN105760164B (en) * 2016-02-15 2020-01-10 苏州浪潮智能科技有限公司 Method for realizing ACL authority in user space file system
CN106326344B (en) * 2016-08-05 2018-09-18 中国水产科学研究院东海水产研究所 A kind of method of the management of distributing big data and retrieval
CN106528761B (en) * 2016-11-04 2019-06-18 郑州云海信息技术有限公司 A kind of file caching method and device
CN107479827A (en) * 2017-07-24 2017-12-15 上海德拓信息技术股份有限公司 A kind of mixing storage system implementation method based on IO and separated from meta-data
CN107734026B (en) * 2017-10-11 2020-10-16 苏州浪潮智能科技有限公司 Method, device and equipment for designing network additional storage cluster

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1692356A (en) * 2002-11-14 2005-11-02 易斯龙系统公司 Systems and methods for restriping files in a distributed file system
US20090248756A1 (en) * 2008-03-27 2009-10-01 Akidau Tyler A Systems and methods for a read only mode for a portion of a storage system
CN102566942A (en) * 2011-12-28 2012-07-11 华为技术有限公司 File striping writing method, device and system
CN103778222A (en) * 2014-01-22 2014-05-07 浪潮(北京)电子信息产业有限公司 File storage method and system for distributed file system
WO2015153671A1 (en) * 2014-03-31 2015-10-08 Amazon Technologies, Inc. File storage using variable stripe sizes
CN109840247A (en) * 2018-12-18 2019-06-04 深圳先进技术研究院 File system and data layout method

Also Published As

Publication number Publication date
CN109840247B (en) 2020-12-18
CN109840247A (en) 2019-06-04

Similar Documents

Publication Publication Date Title
Kang et al. Towards building a high-performance, scale-in key-value storage system
Wang et al. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD
US9235531B2 (en) Multi-level buffer pool extensions
US6219693B1 (en) File array storage architecture having file system distributed across a data processing platform
WO2020125362A1 (en) File system and data layout method
US9135262B2 (en) Systems and methods for parallel batch processing of write transactions
US9557933B1 (en) Selective migration of physical data
WO2021218038A1 (en) Storage system, memory management method, and management node
US20080294698A1 (en) Foresight data transfer type hierachical storage system
CN103873559A (en) Database all-in-one machine capable of realizing high-speed storage
CN111708719B (en) Computer storage acceleration method, electronic equipment and storage medium
JP2005056077A (en) Database control method
Shen et al. Magnet: push-based shuffle service for large-scale data processing
Riedel et al. Data mining on an OLTP system (nearly) for free
Li et al. Elastic and stable compaction for LSM-tree: A FaaS-based approach on TerarkDB
Li et al. Leveraging NVMe SSDs for building a fast, cost-effective, LSM-tree-based KV store
WO2024131379A1 (en) Data storage method, apparatus and system
Su et al. Revitalizing the Forgotten {On-Chip}{DMA} to Expedite Data Movement in {NVM-based} Storage Systems
An et al. Avoiding read stalls on flash storage
Banakar et al. WiscSort: External Sorting For Byte-Addressable Storage
CN115793957A (en) Method and device for writing data and computer storage medium
Son et al. Design and evaluation of a user-level file system for fast storage devices
Xie et al. PetPS: Supporting huge embedding models with persistent memory
US20150177984A1 (en) Management system and management method
Menon et al. Logstore: A workload-aware, adaptable key-value store on hybrid storage systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19899730

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19899730

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10/11/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19899730

Country of ref document: EP

Kind code of ref document: A1