WO2020125362A1

WO2020125362A1 - File system and data layout method

Info

Publication number: WO2020125362A1
Application number: PCT/CN2019/121301
Authority: WO
Inventors: 王洋; 夏明辉; 须成忠
Original assignee: 深圳先进技术研究院
Priority date: 2018-12-18
Filing date: 2019-11-27
Publication date: 2020-06-25
Also published as: CN109840247B; CN109840247A

Abstract

The invention provides a file system comprising a cost calculation module and a region dividing module. The cost calculation module is used for calculating or estimating access cost of a file request in the file system, and is capable of outputting a cost model to the area partition module; the region dividing module is used for dividing files into different regions so as to minimize the total cost of a given access; and the region dividing module is further used for obtaining the size of a stripe corresponding to the region. The invention further provides a data layout method.

Description

File system and data layout method

Technical field

The invention belongs to the technical field of data layout, and particularly relates to a file system and a data layout method.

Background technique

As large-scale data-intensive applications continue to increase in various application areas, I/O (input/output) performance is becoming a bottleneck for storage systems. To solve this problem, those skilled in the art have successively introduced many parallel file systems (Parallel File System, PFS for short) into high-performance storage systems. The above-mentioned parallel file systems include OrangeFS, Lustre, GPFS, PanFS and PLFS, etc. The brief introduction of each parallel file system is as follows:

1. OrangeFS is a branch of the Virtual Parallel File System (PVFS), which is similar to PVFS and is a parallel file system proposed for high-performance computing and high-performance data access. Compared with traditional PVFS, Ora-ngeFS is dedicated to improving the performance of small file processing, increasing the cross-fault tolerance of the server and providing secure access control.

2. Lustre is a Linux cluster parallel file system developed by HP, Intel, Cluster, and the United States Department of Energy. Lustre uses a distributed lock management mechanism to achieve concurrency control, and the metadata and file data communication links are managed separately.

3. GPFS is the abbreviation of General Parallel File System. GPFS from IBM is a scalable, high-performance, general-purpose parallel file system based on shared disks. GPFS can provide parallel, high-speed, safe, and reliable data access for all nodes in the storage system.

4. PanFS is a parallel file system developed by Panasas. PanFS is a general-purpose parallel file system. At present, its main application field is similar to luster. PanFS is a scalable line that can provide strong consistency through distributed locks.

5. PLFS is an open source parallel checkpoint storage file system.

In summary, based on these parallel file systems, the operation of distributing data files across multiple servers can be performed. Therefore, the parallel file system (PFS) can allow multiple tasks of parallel applications to simultaneously access data files in the form of aggregated I/O bandwidth.

However, the existing parallel file system (PFS) is not without defects. The disadvantage is that the existing parallel file system (PFS) does not match the hybrid storage system based on the new storage technology. Before step by step to describe the problem of misfit, the first thing that needs to be clarified is the situation of hybrid storage systems based on new storage technologies. With the development of new storage technologies, the application of solid state drives (Solid State Disk, SSD for short) based on flash memory has become more and more popular. Widely, compared with Hard Disk Drive (HDD), solid-state drives have the characteristics of high storage efficiency, fast response and high cost. Therefore, considering a comprehensive consideration, a reasonable storage system is not suitable for all hard drives, because The read and write and response speeds are slow, and a reasonable storage system is not suitable for all high-cost solid-state drives. In other words, solid-state drives will not completely replace hard drives in a large cluster. Therefore, it is a preferred strategy to use a hybrid storage system that includes both a solid-state drive-based server and a hard disk drive-based server. This strategy is more practical for HPC systems with limited cost budgets. HPC is an abbreviation of High Performance Computing (High Performance Computing) cluster.

On the other hand, the efficiency of the parallel file system (PFS) depends on the effective data file layout, that is, how the data files are distributed on the available nodes. Most existing layout schemes use fixed-size strips to split them into multiple servers. Data files also use fixed-size strips to provide concurrent data access from multiple servers, which even allows data to be placed on each server. Although the existing layout scheme is simple to implement and easy to be widely used, such a layout scheme is obviously suitable for storage systems using homogeneous servers, and is not suitable for hybrid storage systems.

When the existing parallel file system is applied to a hybrid storage system, the performance gap between a solid-state drive-based server and a hard disk drive-based server will significantly reduce the performance of the parallel file system, because solid-state drive-based servers are always better than Hard disk drive servers have higher performance, which requires less I/O time to complete the same amount of data access. If the existing layout scheme is applied, the scheme will give solid-state drive-based servers and hard disk drive-based servers Allocating the same stripe may result in severe load imbalance between heterogeneous servers. In addition, complex I/O workloads may also jeopardize the efficiency of I/O systems.

Summary of the invention

In view of this, in order to solve the problem of unreasonable data distribution when the existing parallel file system (PFS) is matched with the hybrid storage system based on the new storage technology, the present invention provides a file system, the file system includes An I/O tracer, a cost calculation module, and an area division module that are electrically connected to each other, and the I/O tracer is used to provide the area division module with the I/O information collected by itself when the file system is running The I/O tracer is also used to provide the cost calculation module with the configuration file of the file system collected by itself; the cost calculation module is used to calculate or estimate the file request in the file system Access cost to output a cost model to the area dividing module; the area dividing module is used to generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas, the area dividing module It is also used to obtain the stripe size corresponding to the area.

Preferably, the file system further includes a kernel part, and the kernel part is used to perform information or data interaction between a metadata server, a hybrid storage system, and a client; the kernel part includes a FUSE module.

Preferably, the file system further includes a daemon process module, the daemon process module is used to execute the daemon process in the background; and the FUSE module is used as an agent of the daemon process.

Preferably, the file system further includes an update data layout module, the update data layout module and the daemon module, the I/O tracer, the area division module, and the hybrid storage system, respectively Connected, the update data layout module is used to dynamically detect and update area changes.

Preferably, the cost calculation module is used to calculate the total cost of the request, and the total cost calculation formula is: T=T _s +T _c +T ₂ , where T _s represents the FUSE module and the daemon module The time between two context switches, T _c represents the replication time, and T ₂ represents the network and storage costs.

Preferably, the hybrid storage system includes a server SServer based on solid-state drives and a server HServer based on hard drives;

The calculation formula of the replication time is: T _c (r, h, s) ≈ 3 (mh+ns) t _c , where t _c represents the unit data replication time from kernel space to user space, and h represents HServer Band size, s indicates the strip size on SServer, m indicates the number of HServers, and n indicates the number of SServers;

The calculation formula of the network and storage cost is: T ₂ ≈T _e +max{h(t _h +t),s(t _s +t)}, where t represents the data transmission network time, t _h and t _s represents the unit data transmission time on HServer and SServer, respectively, and T _e represents the network connection time.

Preferably, the area division model is used to obtain the minimum cost of dividing 1 event into k areas starting from event i

The minimum cost

Is calculated as:

formula,

Defines a size as

area,

Represents the cost of the first area of size f.

Preferably, servers based on solid-state drives and servers based on hard drives can

Striping, respectively and h _i and s _i, s _i is calculated as s _{_i} = αh _i, h _i is calculated as:

In the formula, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.

The invention also provides a data layout method, which includes:

Step S1: Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division

Step S2, calculate or estimate the access cost of the file request to form a cost model;

Step S3, generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;

Step S4: Obtain the stripe size corresponding to the area.

Compared with the prior art, the beneficial effects of the embodiments of the present invention are:

A file system proposed by the present invention supports region-level data layout by dividing a file into a set of optimal regions, and the file system can determine the optimal region and its stripe size. Therefore, the file system The data layout of the hybrid storage system 100 can be optimized. The file system can flexibly adapt to workload changes and server heterogeneity, thereby significantly speeding up I/O system performance.

BRIEF DESCRIPTION

FIG. 1 is a schematic diagram of a data layout scheme using fixed-size strips in the prior art;

Figure 2 is a schematic diagram of the data layout scheme based on area division;

Figure 3 is a schematic diagram of a file system based on regional data layout;

4 is a schematic diagram of the working principle of the cost calculation module in the embodiment of the present invention;

FIG. 5 is an application example diagram of a file system in an embodiment of the present invention.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

In order to explain the technical solution of the present invention, the following will be described with specific embodiments.

Examples

In general, most PFS uses three typical data layouts: 1-DH, 1-DV, and 2-D. 1-DH layout means that a client process can access data from all storage servers. Contrary to the 1-DH layout, the 1-DV layout means that a client process can only access data from a single storage server. The 2-D layout is between the 1-DH layout and the 1-DV layout, which means that the 2-D layout refers to a client process accessing data from a subset of all storage servers.

First, compare the difference between the data layout of most PFS and the data layout of the regional file system. Take two concurrent reads to access a 9x file as an example. Two concurrent reads are defined as the first read 71 and the second read 72. The first reading 71 and the second reading 72 perform reading according to the time axis 73.

When using the data layout of the traditional file system, as shown in Figure 1, the file is evenly partitioned and stored on each storage server, with a stripe size of 3x. Therefore, when the three servers complete at the same time, each request must be completed within 3x time, the two read requests take a total of 6x time to complete. This kind of data layout ignores the differences between different storage media in the hybrid storage system, so that the read and write efficiency of higher performance storage media cannot be fully reflected. It can be understood that higher performance storage media are forced to be degraded.

As shown in Figure 2, for the regional file system, the data layout scheme has obvious advantages. The regional file system RLFS divides the file 7 into two regions, the first region 73 and the second region 74, each area is partitioned on all servers using its own corresponding stripe size (x or 2x), the second read 72 will be divided into two parts to read, the total time corresponds to read 3x (3x=x +2x) time, but in terms of the time of the second read 72, the two data layouts involved in the comparison are the same, but the first read 71 time based on the regional file system is reduced to the time required to read 2x.

From this example, it can be found that the regional scheme in RLFS is a more fine-grained and more adaptive data layout scheme than the traditional data layout, and corresponds to different stripe sizes in all storage servers. Therefore, the regional scheme in RLFS can be seen as a variant of the 1-DH layout scheme. RLFS can aggregate the bandwidth of all storage servers to maximize I/O performance. RLFS matches the hybrid storage system 100 very well.

RLFS aims to support area-based data layout by using file strips of different sizes. In order to adapt to the hybrid storage system 100 and complex I/O workloads at the same time, RLFS uses a partitioned processing method to achieve the optimal data layout. A cost model is generated in RLFS. According to the cost model, RLFS divides a large file into a set of regions, and each region stores its own strip size separately. When the total cost of all I/O requests of the application is minimized, the optimal regions and their stripe sizes are obtained.

The storage system involved in this embodiment is a hybrid storage system 100. The hybrid storage system 100 includes a solid-state drive-based server 102 and a hard disk drive-based server 101, a solid-state drive-based server referred to as SServer, and a hard disk drive-based server. HServer.

An embodiment of the present invention provides a file system, which is called a region-level file system, that is, Region Level File System, or RLFS for short. The file system can support regional data layout and solve the data distribution problem in the existing parallel file system. RLFS relies on a defined cost model and a heterogeneous sensing scheme based on each region to determine the optimal file stripe size for each server, and further uses the changed access mode to adjust the regional scheme at runtime.

More specifically, first develop a cost model for RLFS to estimate the completion time of area access, so as to divide the file into fine-grained areas using dynamic programming, and then allocate the optimal file strips selected by each area to the HDD and SSD servers size. RLFS is storage system and application-aware. RLFS essentially represents a change from the traditional one-dimensional fixed stripe size layout to a two-dimensionally changing stripe size layout. RLFS can adapt well to server performance and application behavior. Variety. In addition, RLFS also updates the generated data layout scheme based on the detected change in access mode to solve the static data layout problem, making it more suitable for file access at runtime.

As shown in FIG. 3, an embodiment of the present invention provides a file system, which is called a regional file system, that is, Region Level File System, abbreviated as RLFS. The file system can support regional data layout and solve the data distribution problem in the existing parallel file system.

The kernel part of the RLFS package and the user-level daemon module 20. Preferably, the kernel part includes the FUSE module 10. In other words, the file system RLFS provided by the embodiments of the present invention is preferably designed based on the FUSE framework. FUSE refers to the user space file system, which is an abbreviation of Filesystem in Userspace. The kernel part is preferably a Linux kernel module, and the kernel part further includes a VFS module 11, which is a virtual file system, which is an abbreviation of Virtual File System. The VFS module 11 is used to register RLFS. A block device is created in the kernel part. The block device acts as an interface between the daemon process module 20 and the kernel part. The FUSE module 10 acts as an agent of the daemon process module 20 for various file systems issued by the application. request.

The application program from client 1 can access RLFS by mounting RLFS into its name space, and thereafter, all file system calls directed to the mount point are forwarded to FUSE module 10 through VFS module 11. Then, the FUSE module 10 relays the call instruction in the request queue to the daemon module 20 through the block device, wherein, by contacting the metadata server 200 and or other storage server, an appropriate service processing program is called to adapt to the file system call. The response propagates through the kernel part along the reverse path and eventually propagates back to the application. The application is usually in a waiting state after making a request, waiting for a response. The RLFS daemon and storage server should complete all PFS semantics. For example, the read handler should first identify which storage servers have the requested data segment, and which server stores the corresponding data segment, and then issue sub-requests to these servers for parallel access. The kernel part also includes a file log module 12 for recording operation logs for the metadata server 200.

In addition to the general semantics of PFS, RLFS also needs to implement region-based data layout functions. To achieve this goal, RLFS is equipped with an I/O tracer 3 with three user-level components, a cost calculation module 4 and an area division module 5. RLFS completes a three-phase data layout cycle through three user-level components. The data layout cycle starts from the tracking phase. During the tracking phase, the I/O tracer 3 collects the runtime statistics of data access and the summary of the file system used for cost modeling (for example, FUSE queue information) during application execution. Into the trace file. Then, the I/O tracer 3 feeds the read/write trace to the area division module 5, and in the next analysis stage the I/O tracer 3 directs the file system configuration file to the cost calculation module 4, the area division module 5 The updated cost model is used to generate regions, each of which allocates its own stripe size to the two servers. Finally, in the placement phase, the files are placed on the underlying hybrid storage system 100 at runtime, so as to optimize subsequent I/O requests according to the layout plan obtained in the previous phase. Through these three stages, RLFS can greatly improve the I/O performance of the application in subsequent operations. RLFS also includes an updated data layout module 8, which is connected to the daemon module 20, the I/O tracer 3, the area division module 5, and the hybrid storage system 100, respectively. The updated data layout module is used In order to dynamically update the data layout, the update data layout module is used to dynamically detect and update area changes. . Further, the specific functions of the I/O tracer 3, the cost calculation module 4 and the area division module 5 are separately explained:

1. I/O Tracer 3

I/O tracer 3 is used in RLFS to collect both runtime I/O information and file system configuration files. Although there are some technologies and tools in the existing technology that can be used for I/O data collection, such as IOSIG [42], given that the file system provided by the embodiment of the present invention is designed based on the FUSE framework, similar to the existing IOSIG [42] The I/O data collection tools in the technology cannot be directly applied to RLFS. This is determined by the inherent characteristics of the FUSE framework structure. Therefore, in the I/O tracer 3 involved in this embodiment, which follows the N-1 log mode, all RLFS daemons are used to write a single file shared file. Therefore, the designed I/O tracer 3 can help to collect all information of I/O operations, including file access type, operation time, and other process-related data.

After running the corresponding application program using the I/O tracer 3, the process ID, file descriptor, operation type, offset, request size, and timestamp information can be obtained. In order to facilitate further area division and guide the optimal data layout, all I/O requests of the file are sorted in ascending order of their offsets.

I/O information is collected in a specific environment during runtime, and some parameters can also be used to fully understand the collected I/O information. To this end, in addition to I/O information, I/O tracer 3 should be allowed to further collect runtime configuration files about the file system, especially the runtime configuration files based on the file system under the FUSE framework, the configuration file It will be directed to the cost calculation module 4 to assist the updated cost model, and the area division module 5 will further determine the optimal area division according to the minimum total cost obtained by the cost calculation module 4.

Second, the cost calculation module 4

The cost calculation module 4 can generate a cost model, and the cost model aims to find the minimum total cost.

In order to obtain the optimal area division and stripe size of each server in the storage system, the file system proposed in the embodiment of the present invention needs to rely on the cost calculation module 4. In the cost calculation module 4, the cost is defined as the I/O completion time of each file request. The cost calculation module 4 is used to calculate the cost of file request access in the file system. The file system is compatible with the hybrid storage system.

Since the total access cost of a file request is related to the file system itself and the underlying network and storage server, the total access cost of a file request includes the system cost and network and storage costs of the file system. Therefore, the calculation basis of the cost calculation module 4 should include the system cost of the file system and the network and storage costs. Correspondingly, the cost calculation module 4 includes a system cost calculation module 41 and a network and storage cost calculation module 42.

Since the file system proposed in the embodiment of the present invention is built on the FUSE framework, the system cost of the file system mainly refers to the time overhead in the FUSE data path. Since the main goal of RLFS is to optimize the read request through the optimal position of the data file on the hybrid storage system 100, only the system cost related to the read request is defined in this embodiment, and the cost of the write request can also be followed by Export with the same parameters.

As shown in FIG. 4, for each read request, the service time is divided into three sub-parts, one is the waiting time in the FUSE module 10, and the other is the two between the FUSE module 10 and the daemon module 20. The time of context switching, the third is the time of the three copy operations collected in the first copy.

Data flows from the network system containing m HServers and n SServers to the daemon module 20, and then from the daemon module 20 to the FUSE module 10, and finally sent to the client 1 by the FUSE module 10.

The time to wait for a read request in the FUSE module 10 queue is closely related to the application running between the client 1 and RLFS. The time to wait for the read request in the FUSE module 10 queue depends not only on the I/O request made by the application Mode, which is also related to other factors caused by the file system, such as page caching or interruption. Therefore, it is difficult to estimate it accurately. However, when considering the factor of minimizing the queue delay through the multi-thread support of the RLFS daemon module 20, it can be safely assumed that the waiting time of the read request in the queue of the FUSE module 10 is negligible, that is, _Tq =0.

Further, the context switching time is system-dependent and independent of the data size, it can be regarded as a constant value. Therefore, the calculation formula of the time T _s for switching between two contexts between the FUSE module 10 and the daemon module 20 is: T _s = 2 μ. Where μ is the context switching time.

The first copy of the collected time copy operation of three, referred to as reproduction time T _c, r is proportional to the size of the data reproduction time T _c with the requested file, which is calculated as:

T _c (r,h,s) = 3rt _c

The file request data size is r, and the calculation formula for the file request data size r is:

r=ms _m +ns _n

s _m and s _n represent the maximum sub-request size on HServer and the maximum sub-request size on SServer, and s _m ≤h and s _n ≤s, h represents the stripe size on HServer, s represents the stripe size on SServer, So, further, the replication time T _c can be expressed as:

T _c (r,h,s)≈3(mh+ns)t _c

t _c is the copy time of unit data from kernel space to user space. Therefore, the first part of the total cost calculated by the system cost calculation module 41 is represented as T ₁ , T ₁ =T _s +T _c .

While the network by the network computing and storage costs and storage costs calculation module 42 comprises: a network connection time T _e, T _a memory access times and network transmission time T _x. In PFS, requests are divided into a set of subtasks, and each subtask is forwarded to a separate storage server for parallel execution. Therefore, the cost of request subcomponents in the network and storage server is determined by the maximum cost of all subrequests. Assuming that each type of server (HServer or SServer) has the same configuration for the network and storage, the network transmission time T _x can be determined according to the data size (s _m and s _n ) and the data transmission network time t. The specific formula is:

And wherein the s _m s _n represent the largest sub-request HServer the maximum size of the child and SServer request size.

Similar to the network transmission time T _x , the storage access time T _a is determined by the sub-request. The specific formula is:

In the above formula, s _m and s _n represent the maximum sub-request size on the HServer and the maximum sub-request size on the SServer, respectively. t _h and t _s represent the unit data transmission time on the HServer and the unit data transmission time on the SServer.

With the memory access time T _a and T _x different network transmission time, the network connection time T _e is a constant, independent of the data size. In summary, the network and storage cost T ₂ calculated by the network and storage cost calculation module 42 can be expressed by the formula:

Further network and storage cost time T ₂ can be expressed as:

T ₂ ≈T _e +max{h(t _h +t),s(t _s +t)}

In the above formula, h indicates the strip size on the HServer, and s indicates the strip size on the SServer.

It can be seen from the cost calculation module 4 that the total cost T of the request can be expressed as: T=T ₁ +T ₂ , and the total cost of the request is a function of parameters describing the application, file system, and data layout. Therefore, it is highly heterogeneous, determined by the server stripe size h and s.

In addition, it should be noted that, since the read and write in SServers are very different, the write request involves more operations than read. At this time, two contexts are performed between the FUSE module 10 and the daemon module 20 The switching time T _s needs to include the time for write amplification, garbage collection and wear leveling.

In order to facilitate the explanation of the working principle of the cost calculation module 4, the parameters of the cost analysis mode involved in the cost calculation module 4 are presented in a table form, as shown in Table 1.

Table 1 Parameters in the cost analysis mode

Third, the area division module 5

Guided by the cost model generated by the cost calculation module 4, the area division module 5 can divide the file into different areas, trying to minimize the total cost of a given access set featuring parallel applications. The existing area division device has HARL, and HARL divides the area division and stripe size determination into two different stages to deal with. Unlike HARL, the layout strategy of RLFS is integrated, and the layout strategy of RLFS is a unified The method considers the problem of area division and stripe size determination, so RLFS can determine area division and stripe size at a time. RLFS does not scan trace files in a heuristic way to find logical regions like HARL, but puts logical regions and physical blocks together with the goal of minimizing the total cost. This consideration is easy to understand because the smallest unit of file access is a block, such as 64MB or 128MB, and the logical area can naturally span a sequence of adjacent physical blocks.

The first algorithm can be executed in the area division module 5. The first algorithm is an offline form of the most relatively fast algorithm. The first algorithm can be repeated periodically to adapt to the dynamic characteristics of the access. "Relatively fast" means that the algorithm is pseudo-polynomial time. The essence of the algorithm is to first represent the shared file as a sequence of blocks, then partition the file in blocks according to the given access request, and finally use the dynamic programming module to partition from these partitions. Find the optimal area division.

According to the I/O events given by the access mode, such as starting or ending an I/O operation, the file F has an example of the size of L, defined by the number of segments (L=12 segments), and the sequence of adjacent segments is merged into regions, Each area is separated by a red vertical dotted line. The data between HServer and SServer in each area is striped, and logical I/O requests can be processed by a single multiple physical requests related to the requested data. With this layout optimization, the total access cost is minimized according to the defined cost model compared to traditional strategies.

Assume

Represents the minimum cost when a file with a l request event starting at index i is divided into k regions, and the following recursion is used to calculate 0≤i<l:

among them,

An area is defined with a size of

It can be expressed as:

Will be striped and SServer in HServer, respectively, and h _i s _i.

From recursion, we can get the minimum cost of dividing l events into k regions starting from event i

When m changes from l to li, the cost of the first region of size f

Add to the rest

To calculate the minimum sum. When the number of segments is insufficient to support the remaining k-region division, set

Otherwise set

Given

After the definition of

To

Regional (sub)requests, then calculate

the cost of

Calculated as follows:

The equation _{T (r, h i, s} i) are defined in the cost model, we assumed that s _{_i} = αh _i, then there is:

Here, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.

Through the above four equations, you can get the optimal area division of the file layout and minimize the cost of a given request. Further, the files are partitioned and placed on the underlying heterogeneous server, and each area on the underlying heterogeneous server corresponds to a certain stripe size. After that, for each request R, the corresponding area can be read according to its stripe size to meet the requirements.

FIG. 5 is an application example diagram of a file system in an embodiment of the present invention. As shown in FIG. 5, the file client 1 issues a request on behalf of the application program from the computing server 301, the hybrid storage system 100 is responsible for storing and managing the stripped area, and the metadata server 200 (MDS) contains the files stored in the RLFS. Description. During file operations, the client 1 first contacts the MDS to obtain file metadata, and then uses it to perform data access with the hybrid storage system 100 through the RLFS daemon.

RLFS logically maps a large file to multiple small (area) files, and each file represents a file area with a similar I/O workload. The zone files are further stripped on all HServers and SServers, and each stripe is stored as a separate data file in each storage server. To this end, MDS maintains an area strip table (RST) for each physical file in RLFS, as shown in Table 2 below, where each area of the file is recorded according to the offset and stripe size in each server. When the file is written to RLFS, the area bar table (RST) is created by the area division module 5, and the area bar table (RST) is updated when the access mode is changed. To improve efficiency, you can store the RST cache and resolution of files to be read in the same directory as the application when installing and uninstalling RLFS.

Table 2 Regional Bar Table Data Structure

For the cost calculation module 4, a file server in the parallel file system is used to test the context switching time, unit data copy time and unit data transfer time of HServer and SServer with read/write mode. These parameters can vary with different I/ O mode. In addition, a pair of nodes, a client node and a file server, are used to estimate the network transmission time. The network transmission time test can be repeated thousands of times, and then the average of them is calculated as the parameter value of the generated cost model.

To perform the optimal data layout for a specific file, RLFS first uses its area division module 5 to calculate the optimal area division of the file, and then uses the cost model and I/O tracking data to determine the stripe size of each area. The optimal area information is calculated for writing files on each server at the same time, and the area dividing module 5 creates an RST for subsequent reading of the files in the MDS. MDS holds the RLFS namespace, RST, and other information about each file. However, due to the limited number of regions given by the area division algorithm, the size of the MDS is highly controlled, and the size of the MDS is small.

In addition, in order to facilitate parallel reading of each file, the hybrid storage system 100 maintains a flat namespace, where each file can be identified by "filename_region#_stripe#" in the local disk. Note that "filename" can contain path information specified by the application. The background I/O daemon is used to receive incoming requests from client 1, which is characterized by "filename", "region#" and "stripe#", and serves the request by sending back the requested stripe file. Band files are combined with other band files to meet the needs of the application.

A file system (RLFS) proposed by the present invention supports region-level data layout by dividing a file into a set of optimal regions, so that the file system can determine the optimal region and its stripe size. Therefore, by This file system can optimize the data layout of the hybrid storage system 100. Compared with the kernel method, using the FUSE module 10 not only greatly simplifies the development work, but also allows access to RLFS through the standard file system interface, allowing applications to access RLFS in a transparent manner, and variable-size RLFS can ease the The load is unbalanced, which can flexibly adapt to workload changes and server heterogeneity, thereby significantly speeding up I/O system performance.

The file system (RLFS) proposed in the embodiment of the present invention has been experimentally verified, determined to be feasible, and has excellent performance. Experimental results show that RLFS can work well with the hybrid storage system 100, and RLFS greatly improves parallel I/O performance.

In the experiment, three data layout schemes were compared: scheme one uses fixed-size strips; scheme two uses randomly selected stripes, and scheme three uses RLFS. For reading and writing, RLFS uses the optimal data layout of {32KB, 160KB} and {36KB, 148KB}, respectively, which improves I/O performance by 73.4% and 176.7 compared to the default layout with 64KB fixed-size stripes. %. Compared with other layouts with different but fixed-size stripes, RLFS improves read performance to 138.6% and write performance to 177.6%. Compared with the randomly selected stripe strategy, RLFS improves read performance to 154.5% and write performance to 215.4%.

Experimental results based on representative benchmarks show that RLFS is a promising and feasible solution in a hybrid parallel file system, with parallel I/O performance improved from 20.6% for reads to 556.1% and writes for 22.7% to 288.7% .

The invention also provides a data layout method, which includes:

Step S4: Obtain the stripe size corresponding to the area.

The beneficial effect of the above method is that this method supports regional data layout by dividing the file into a set of optimal regions to determine the optimal region and its stripe size. This method optimizes the data of the hybrid storage system layout.

The above is only the preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present invention, several improvements and retouches can be made. These improvements and retouches also It should be regarded as the protection scope of the present invention.

Claims

A file system, characterized in that the file system includes an I/O tracer, a cost calculation module and an area division module electrically connected to each other, and the I/O tracer is used to provide the area division module I/O information collected by the file system during operation; the I/O tracer is also used to provide the cost calculation module with the configuration file of the file system collected by itself; the cost calculation module Used to calculate or estimate the access cost of file requests in the file system to output a cost model to the area division module; the area division module is used to generate a distribution area with a minimum total cost according to the cost model, and In different areas into which the file is divided, the area dividing module is also used to obtain the stripe size corresponding to the area.
The file system according to claim 1, wherein the file system further comprises a kernel part, and the kernel part is used to perform information or data interaction between a metadata server, a hybrid storage system, and a client ; The core part includes a FUSE module.
The file system according to claim 2, wherein the file system further comprises a daemon process module, the daemon process module is used to execute the daemon process in the background; the FUSE module is used as an agent of the daemon process .
The file system according to claim 3, wherein the file system further comprises an update data layout module, the update data layout module and the daemon module, the I/O tracer, the An area division module is connected to the hybrid storage system, and the update data layout module is used to dynamically detect and update area changes.
The file system according to claim 4, wherein the cost calculation module is used to calculate the total cost of the request, the total cost calculation formula is: T = T s + T c + T 2 , in the formula, T s represents The time for the two context switches between the FUSE module and the daemon module, T c represents the replication time, and T 2 represents the network and storage costs.
The file system according to claim 5, wherein the hybrid storage system includes a server SServer based on a solid state drive and a server HServer based on a hard disk drive;

The calculation formula of the replication time is: T c (r, h, s) ≈ 3 (mh+ns) t c , where t c represents the unit data replication time from kernel space to user space, and h represents HServer Band size, s indicates the strip size on SServer, m indicates the number of HServers, and n indicates the number of SServers;

The calculation formula of the network and storage cost is: T 2 ≈T e +max{h(t h +t),s(t s +t)}, where t represents the data transmission network time, t h and t s represents the unit data transmission time on HServer and SServer, respectively, and T e represents the network connection time.
The file system according to claim 3, 5 or 6, wherein the area division module is used to obtain a minimum cost for dividing 1 event into k areas starting from event i
The minimum cost
Is calculated as:

formula,
Defines a size as
area,
Represents the cost of the first area of size f.
The file system according to claim 7, wherein the server based on the solid state drive and the server based on the hard disk drive can
Striping, respectively and h i and s i, s i is calculated as s i = αh i, h i is calculated as:

In the formula, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.
A data layout method is characterized in that it includes:

Step S1: Collect the I/O information of data access at runtime and the file system configuration file used for cost modeling into the tracking file, orient the file system configuration file to establish the cost model, and use the I/O information to Area division

Step S2, calculate or estimate the access cost of the file request to form a cost model;

Step S3, generate a distribution area with a minimum total cost according to the cost model, and divide the file into different areas;

Step S4: Obtain the stripe size corresponding to the area.