CN109840247B - File system and data layout method - Google Patents

File system and data layout method Download PDF

Info

Publication number
CN109840247B
CN109840247B CN201811547400.9A CN201811547400A CN109840247B CN 109840247 B CN109840247 B CN 109840247B CN 201811547400 A CN201811547400 A CN 201811547400A CN 109840247 B CN109840247 B CN 109840247B
Authority
CN
China
Prior art keywords
module
file system
cost
file
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811547400.9A
Other languages
Chinese (zh)
Other versions
CN109840247A (en
Inventor
王洋
夏明辉
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201811547400.9A priority Critical patent/CN109840247B/en
Publication of CN109840247A publication Critical patent/CN109840247A/en
Priority to PCT/CN2019/121301 priority patent/WO2020125362A1/en
Application granted granted Critical
Publication of CN109840247B publication Critical patent/CN109840247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types

Abstract

The invention provides a file system, which comprises a cost calculation module and an area division module, wherein the cost calculation module is used for calculating or estimating the access cost of a file request in the file system, and the cost calculation module can output a cost model to the area division module; the region dividing module is used for dividing the file into different regions so as to minimize the total cost of given access; the region dividing module is further configured to obtain a stripe size corresponding to the region. The invention also provides a data layout method. Compared with the prior art, the invention has the beneficial effects that: the file system supports the data layout of the region level by dividing the files into a group of optimal regions, and the optimal regions and the corresponding stripe sizes can be determined through the file system, so that the file system can optimize the data layout of the hybrid storage system, further flexibly adapt to the change of the working load and the heterogeneity of the storage system, and can remarkably accelerate the performance of the I/O system.

Description

File system and data layout method
Technical Field
The invention belongs to the technical field of data layout, and particularly relates to a file system and a data layout method.
Background
As large-scale data intensive applications continue to increase in various application domains, I/O (input/output) performance is becoming a bottleneck in storage systems. To solve this problem, those skilled in the art introduce a plurality of Parallel File Systems (PFS) into the high-performance storage System one after another. The parallel file systems include OrangeFS, Lustre, GPFS, PanFS, PLFS, and the like, and the brief introduction of each parallel file system is as follows:
1. OrangeFS is a branch of the virtual parallel file system (PVFS), which, like PVFS, is a proposed parallel file system for high performance computing and high performance data access. Compared to the conventional PVFS, Ora-ngeFS is directed to improving the performance of small file processing, increasing cross-tolerance of servers, and providing secure access control.
2. Lustre is a Linux Cluster parallel File System developed by HP, Intel, Cluster File System company in combination with the united states department of energy, and adopts a distributed lock management mechanism to realize concurrency control and separately manage communication links of metadata and File data.
3. GPFS is an abbreviation for General Parallel File System. The GPFS from IBM corporation is an extensible, high-performance, shared disk-based, general-purpose parallel file system that provides parallel, high-speed, secure, reliable data access for all nodes in a storage system.
4. PanFS is a parallel file system developed by Panasas corporation, a general-purpose parallel file system, whose main application field is similar to that of a lustrer, and is extensible and capable of providing strong consistency through distributed locks.
5. PLFS is an open source parallel checkpoint storage file system.
In summary, based on these parallel file systems, it is possible to perform operations that distribute data files across multiple servers, and therefore, a Parallel File System (PFS) may allow multiple tasks of a parallel application to access data files synchronously in the form of aggregated I/O bandwidth.
However, the existing Parallel File System (PFS) is not without defects, and has a defect that the existing Parallel File System (PFS) is not matched with the hybrid storage system based on the new storage technology. Before the problem of improper configuration is gradually described, what needs to be clarified first is the situation of a hybrid storage system based on a novel storage technology, and with the development of the novel storage technology, a Solid State Drive (SSD) based on a flash memory is more widely used, and for a Hard Disk Drive (Hard Disk Drive, HDD), the Solid State Drive has the characteristics of high storage efficiency, fast response and high cost, so that, comprehensively, a reasonable storage system is not suitable for being composed of all Hard Disk drives, and because the read-write and response speeds are slow, the reasonable storage system is not suitable for being composed of all Solid State drives with high manufacturing cost, in other words, the Solid State drives in a large cluster do not completely replace the Hard Disk drives. Therefore, using a hybrid storage system that includes both solid state drive-based servers and hard drive-based servers is a preferred strategy. This strategy is more practical for HPC systems with limited cost budgets. HPC is a short for High Performance Computing (High Performance Computing) fleet.
On the other hand, the efficiency of a Parallel File System (PFS) depends on the efficient data file layout, i.e. how the data files are distributed over the available nodes, most existing layout schemes use fixed-size stripes to split up the data files over multiple servers, and also use fixed-size stripes to provide concurrent data access from multiple servers, which even allows data placement on each server. Although the existing layout scheme is simple to implement and is easy to be widely used, the layout scheme is obviously suitable for a storage system using a homogeneous server and is not suitable for a hybrid storage system.
When the existing parallel file system is applied to a hybrid storage system, the performance gap between the solid state drive-based server and the hard disk drive-based server may significantly reduce the performance of the parallel file system, because the solid state drive-based server always has higher performance than the hard disk drive-based server, and thus requires less I/O time to complete the same amount of data access, which may cause a severe imbalance in load between heterogeneous servers if the existing layout scheme is applied, and additionally, the efficiency of the I/O system may be compromised by the complex I/O workload.
Disclosure of Invention
In view of this, in order to solve the problem of unreasonable data distribution generated when the existing Parallel File System (PFS) is matched with the hybrid storage system based on the novel storage technology, the present invention provides a file system, which includes an I/O tracer, a cost calculation module and a region division module, which are electrically connected to each other, wherein the I/O tracer is configured to provide, to the region division module, I/O information that is collected by the I/O tracer when the file system operates; the I/O tracer is also used for providing the cost calculation module with the configuration file of the file system collected by the I/O tracer; the cost calculation module is used for calculating or predicting the access cost of the file request in the file system so as to output a cost model to the region division module; the area dividing module is used for generating a distribution area with the minimized total cost according to the cost model and dividing files into different areas, and the area dividing module is also used for obtaining the size of a strip corresponding to the area.
Preferably, the file system further comprises a kernel part, wherein the kernel part is used for executing the interaction of information or data among the metadata server, the hybrid storage system and the client side; the kernel portion includes a FUSE module.
Preferably, the file system further comprises a daemon module, wherein the daemon module is used for executing daemon in the background; the FUSE module is used as a proxy of the daemon.
Preferably, the file system further includes an update data layout module, the update data layout module is respectively connected to the daemon process module, the I/O tracer, the area division module, and the hybrid storage system, and the update data layout module is configured to dynamically detect and update an area change.
Preferably, the cost calculation module is configured to calculate a total cost of the request, and the total cost calculation formula is: t ═ Ts+Tc+T2In the formula, TsRepresents the time of two context switches between the FUSE module and the daemon module, TcDenotes the reproduction time, T2Representing network and storage costs.
Preferably, the hybrid storage system comprises a solid state drive-based server SServer and a hard disk drive-based server HServer;
the calculation formula of the replication time is as follows: t isc(r,h,s)≈3(mh+ns)tcIn the formula tcRepresenting the unit data copying time from kernel space to user space, h representing the size of a strip on HServer, s representing the size of a strip on SServer, m representing the number of HServers, and n representing the number of SServers;
the network and storage becomeThe calculation formula is as follows: t is2≈Te+max{h(th+t),s(ts+ t) }, where t denotes the data transmission network time, thAnd tsRespectively representing the cell data transmission time on HServer and the cell data transmission time on SServer, TeRepresenting the network connection time.
Preferably, the region division module is used to obtain the minimum cost for dividing the i events into k regions from the event i
Figure GDA0002731175200000041
The minimum cost
Figure GDA0002731175200000042
The calculation formula of (2) is as follows:
Figure GDA0002731175200000043
in the formula, the first step is that,
Figure GDA0002731175200000044
Figure GDA0002731175200000045
define a size of
Figure GDA0002731175200000046
The area of the image to be displayed is,
Figure GDA0002731175200000047
the cost of the first area of size f is indicated.
Preferably, the solid state drive based server and the hard disk drive based server are capable of connecting
Figure GDA0002731175200000048
Striping and obtaining h respectivelyiAnd si,siIs calculated as si=αhi,hiThe calculation formula of (2) is as follows:
Figure GDA0002731175200000049
in the formula, α ≧ 1 and is the SServer spreading factor relative to HServer, B denotes the block size in the configuration.
The invention also provides a data layout method, which comprises the following steps:
step S1, collecting the I/O information of data access during operation and the file system configuration file for cost modeling into a tracking file, orienting the file system configuration file for establishing a cost model, and using the I/O information for region division;
step S2, calculating or estimating the access cost of the file request to form a cost model;
step S3, generating a distribution area with minimized total cost according to the cost model, and dividing the files into different areas;
step S4, obtaining the size of the strip corresponding to the area.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
the file system provided by the invention supports the data layout of the region level by dividing the file into a group of optimal regions, and the optimal regions and the stripe sizes thereof can be determined by the file system, so that the data layout of the hybrid storage system 100 can be optimized by the file system. The file system can flexibly adapt to the change of the working load and the server heterogeneity, thereby obviously accelerating the performance of the I/O system.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a diagram of a prior art data layout scheme using fixed-size stripes;
FIG. 2 is a schematic diagram of a data layout scheme based on region partitioning;
FIG. 3 is a schematic diagram of a file system based on a region data layout;
FIG. 4 is a schematic diagram of the operation of the cost calculation module according to an embodiment of the present invention;
FIG. 5 is a diagram of an exemplary application of a file system in an embodiment of the present invention.
Detailed Description
The above and further features and advantages of the present invention are described in more detail below with reference to the accompanying drawings.
In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two unless specifically defined otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention.
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Examples
In general, most PFSs employ three typical data layouts: 1-DH, 1-DV and 2-D. A1-DH layout means that one client process can access data from all storage servers. In contrast to the 1-DH layout, the 1-DV layout means that a client process can only access data from a single storage server. While the 2-D layout is between the 1-DH layout and the 1-DV layout, which means that a 2-D layout refers to a client process accessing data from a subset of all storage servers.
First, the differences between the data layout of most PFSs and the data layout of the region level file system are compared, taking as an example the execution of two concurrent reads accessing one 9x sized file, the two concurrent reads being defined as a first read 71 and a second read 72. The first reading 71 and the second reading 72 are read according to a time axis 73.
When the data layout approach of the conventional file system is used, as shown in fig. 1, the file is partitioned evenly and stored on each storage server with a stripe size of 3 x. Thus, when three servers complete at the same time, each request is completed in 3x of time, which together takes 6x of time to complete for the two read requests. The data layout ignores the difference between different storage media in the hybrid storage system, so that the read-write efficiency of the storage media with higher performance cannot be fully reflected, and the storage media with higher performance can be understood to be forcibly degraded for use.
As shown in fig. 2, the data layout scheme of the region-level file system has obvious advantages, the region-level file system RLFS divides the file 7 into two regions, namely a first region 73 and a second region 74, each region is partitioned on all servers by using its own corresponding stripe size (x or 2x), the second read 72 is divided into two parts, the total time corresponds to the time for reading 3x (3x ═ x +2x), but the two data layout modes participating in comparison are the same in terms of the time for reading the second read 72, but the time for reading the first read 71 based on the region-level file system is reduced to the time required for reading 2 x.
From this example, it can be seen that the region scheme in RLFS is a finer-grained, more adaptive data layout scheme than the conventional data layout, with different stripe sizes corresponding to all storage servers. Thus, the region scheme in RLFS can be seen as a variation of the 1-DH layout scheme, where RLFS can aggregate the bandwidth of all storage servers to maximize I/O performance. RLFS closely matches the hybrid storage system 100.
RLFS is intended to support region-based data layout by using different sized file strips. To accommodate both the hybrid storage system 100 and the complex I/O workload, RLFS employs a partition handling approach to achieve optimal data placement. A cost model is generated in the RLFS, according to the cost model, the RLFS divides a large file into a group of areas, and each area independently stores the size of a stripe of the area. The optimal area and their stripe size results when the total cost of all I/O requests of the application is minimized.
The storage system in this embodiment is a hybrid storage system 100, and the hybrid storage system 100 includes a server 102 based on a solid state drive and a server 101 based on a hard disk drive, where the server based on the solid state drive is referred to as SServer for short, and the server based on the hard disk drive is referred to as HServer.
The embodiment of the invention provides a File System, which is called a Region Level File System, namely a Region Level File System (RLFS). The file system can support regional level data layout and solve the problem of data distribution in the existing parallel file system. RLFS relies on a defined cost model and a heterogeneous awareness scheme based on each region to determine the optimal file stripe size for each server and further leverages the changed access patterns to adjust the region scheme at runtime.
More specifically, a cost model is first developed for RLFS to estimate the completion time of area access, so that the file is divided into fine-grained areas using dynamic programming, and then the optimal file stripe size selected for each area is allocated for the HDD and SSD server. The RLFS is perceived by a storage system and an application program, essentially represents the change of a strip size layout mode from a traditional one-dimensional fixed strip size layout to a two-dimensional change, and can well adapt to the change of server performance and application behaviors. In addition, the RLFS also updates the generated data layout scheme according to the detected change of the access mode to solve the static data layout problem, so that the generated data layout scheme is more suitable for file access at runtime.
As shown in fig. 3, an embodiment of the present invention provides a File System, which is called a Region Level File System, i.e., a Region Level File System, referred to as RLFS for short. The file system can support regional level data layout and solve the problem of data distribution in the existing parallel file system.
The RLFS package includes a kernel portion, which preferably includes the FUSE module 10, and a user-level daemon module 20. In other words, the file system RLFS provided by the embodiment of the present invention is preferably designed based on the FUSE framework. FUSE refers to a user space file system, which is an abbreviation for Filesystem in Userspace. The kernel part is preferably a Linux kernel module, and further comprises a VFS module 11, where VFS is a Virtual File System, which is abbreviated as Virtual File System. The VFS module 11 is used to register RLFS, a block device is created in the kernel portion, the block device serves as an interface between the daemon module 20 and the kernel portion, and the FUSE module 10 serves as a proxy for the daemon module 20 for various file system requests issued by the application program.
The application from client 1 may access RLFS by mounting it to its namespace, after which all file system calls to mount points are forwarded to the FUSE module 10 via the VFS module 11. The FUSE module 10 then relays the call instructions in the request queue through the block device to the daemon module 20, where the appropriate service handler is invoked to accommodate the file system call by contacting the metadata server 200 and or other storage servers. The response travels along the reverse path through the kernel portion and ultimately back to the application, which is typically in a wait state after issuing a request, waiting for a response. The daemon and storage server of RLFS should complete all the semantics of PFS. For example, the read handler should first identify which storage servers have the requested data segment, and which of each server store the corresponding data segment, and then issue sub-requests for parallel access to these servers. The kernel portion also includes a file logging module 12 for logging operations with respect to the metadata server 200.
In addition to the general semantics of PFS, RLFS also needs to implement region-based data layout functions. To achieve this, the RLFS is equipped with an I/O tracer 3 with three user-level components through which it completes one three-phase data layout cycle, a cost calculation module 4 and a region partitioning module 5. The data placement cycle begins with a trace phase where the I/O tracer 3 collects runtime statistics of data accesses and file system summaries (e.g., FUSE queue information) for cost modeling into a trace file during application execution. The I/O tracer 3 then feeds the read/write traces to the zone partitioning module 5 and at the next analysis stage the I/O tracer 3 directs the file system configuration file to the cost computation module 4, the zone partitioning module 5 using the updated cost model to generate zones, each zone assigning its own stripe size to both servers. Finally, in the placement phase, the files are placed on the underlying hybrid storage system 100 at runtime to optimize subsequent in-flight I/O requests according to the layout scheme obtained in the previous phase. Through the three phases, the RLFS can greatly improve the I/O performance of the application program in subsequent operation. The RLFS further includes an update data layout module 8, which is respectively connected to the daemon module 20, the I/O tracer 3, the area division module 5, and the hybrid storage system 100, and is configured to dynamically update data layout, and dynamically detect and update area changes. . Further, the specific functions of the I/O tracer 3, the cost calculation module 4 and the region division module 5 are respectively set forth:
I/O tracer 3
The I/O tracer 3 is used in RLFS both to collect runtime I/O information and to collect file system configuration files. While there are some techniques and tools available in the prior art for I/O data collection, such as IOSIG [42], prior art I/O data collection tools like IOSIG [42] cannot be directly adapted for RLFS given that the file system provided by embodiments of the present invention is based on a FUSE framework design. This is determined by the inherent properties of the FUSE frame structure. Thus, in the I/O tracer 3 to which this embodiment refers, which follows the N-1 log pattern, all RLFS daemons are used to write to a single file share file. Thus, the designed I/O tracer 3 can help collect all information for I/O operations, including file access type, operation time, and other process-related data.
After running the corresponding application using the I/O tracer 3, the process ID, file descriptor, operation type, offset, request size and timestamp information can be obtained. To facilitate further region partitioning and to guide optimal data placement, all I/O requests for a file are ordered in ascending order of their offsets.
Runtime I/O information is collected under a particular environment, and some parameters may be used to make full knowledge of the collected I/O information. To this end, in addition to the I/O information, the I/O tracer 3 should be allowed to further collect runtime profiles about the file system, in particular based on the file system under the FUSE framework, which profiles will be directed into the cost calculation module 4, assisting the updated cost model, the region partitioning module 5 will further determine the optimal region partitioning from the minimum total cost obtained by the cost calculation module 4.
Second, cost calculating module 4
The cost calculation module 4 is able to generate a cost model and the cost model aims at finding the minimum total cost.
In order to obtain the optimal area partition and the stripe size of each server in the storage system, the file system provided in the embodiment of the present invention needs to rely on the cost calculation module 4. In the cost calculation module 4, the cost is defined as the I/O completion time of each file request. The cost calculation module 4 is used for calculating the cost of the file request access in the file system. The file system is matched to the hybrid storage system.
Since the total cost of access for a file request is related to the file system itself and the underlying network and storage servers, the total cost of access for a file request includes the system cost and the network and storage costs of the file system. Therefore, the calculation basis of the cost calculation module 4 should include the system cost and the network and storage cost of the file system. Correspondingly, cost calculation module 4 includes a system cost calculation module 41 and a network and storage cost calculation module 42.
Since the file system proposed by the embodiment of the present invention is built on the FUSE framework, the system cost of the file system mainly refers to the time overhead in the FUSE data path. Since the main objective of RLFS is to optimize the read request by the optimal location of the data file on the hybrid storage system 100, only the system cost for the read request is defined in this embodiment, and the cost for the write request can also be derived by following the same parameters.
As shown in FIG. 4, for each read request, the service time is divided into three sub-parts, one is the waiting time in the FUSE module 10, two is the time for two context switches between the FUSE module 10 and the daemon module 20, and the third is the time for the first copy of the three collected copy operations.
Data flows from a network system containing m HServers and n SServers to the daemon module 20, then flows from the daemon module 20 to the FUSE module 10, and finally is sent to the client 1 by the FUSE module 10.
The time to wait for a read request in the FUSE module 10 queue is closely related to the application running between the client 1 and the RLFS, and the time to wait for a read request in the FUSE module 10 queue depends not only on the I/O request pattern made by the application, but also on other factors caused by the file system, such as page buffers or interrupts. Therefore, it is difficult to estimate it accurately. However, when considering the factor of minimizing queue delay through multithreading support by the daemon module 20 of RLFS, it can be safely assumed that the latency of read requests in the queue of the FUSE module 10 is negligible, i.e., Tq=0。
Further, the context switch time is system dependent and independent of the data size, which can be considered as a constant value. Therefore, the time T for two context switches between the FUSE module 10 and the daemon module 20sThe calculation formula of (2) is as follows: t iss2 μ. Where μ is the context switch time.
The time of the three copy operations collected by the first copy is abbreviated as the copy time TcTime of replication TcIn proportion to the data size r of the file request, the calculation formula is as follows:
Tc(r,h,s)=3rtc
the data size of the file request is r, and the calculation formula of the data size r of the file request is as follows:
r=msm+nsn
smand snRespectively represent the largest sub-request size on HServer and the largest sub-request size on SServer, andsmh and s is less than or equal tonS.ltoreq.h denotes the size of the band on HServer and s denotes the size of the band on SServer, so, further, the replication time TcCan be expressed as:
Tc(r,h,s)≈3(mh+ns)tc
tcis the unit data copy time from kernel space to user space. Thus, the first portion of the total cost calculated by the system cost calculation module 41 is denoted as T1,T1=Ts+Tc
And the network and storage costs calculated by the network and storage cost calculation module 42 include: network connection time TeStorage access time TaAnd network transmission time Tx. In PFS, a request is divided into a set of subtasks, each of which is forwarded to a separate storage server for parallel execution. Therefore, the request subcomponent cost in the network and storage servers is determined by the maximum cost of all the sub-requests. Given that each type of server (HServer or SServer) has the same configuration for network and storage, it can be based on the data size(s)mAnd sn) Determining a network transmission time T from a data transmission network time TxThe concrete formula is as follows:
Figure GDA0002731175200000121
in the above formula smAnd snRepresenting the largest sub-request size on HServer and the largest sub-request size on SServer, respectively.
And network transmission time TxSimilarly, the memory access time TaThe sub-request determines that the concrete formula is as follows:
Figure GDA0002731175200000122
in the above formula smAnd snRespectively representing the largest sub-request size on HServer and the largest sub-request size on SServer, thAnd tsRespectively representing the data transmission time of the unit on HServer and the data transmission time of the unit on SServer.
And the memory access time TaAnd network transmission time TxDifferent, network connection time TeIs constant and is independent of data size. In summary, the inter-network and storage costs T calculated by the network and storage costs calculation module 422Can be expressed by the formula:
Figure GDA0002731175200000131
further network and storage cost time T2Can be expressed as:
T2≈Te+max{h(th+t),s(ts+t)}
in the above formula, h represents the size of the band on HServer, and s represents the size of the band on SServer.
As can be seen from the cost calculation module 4, the total cost T of the request can be expressed as: t ═ T1+T2The total cost of the request is a function of parameters describing the application, file system, and data layout. It is therefore highly heterogeneous, determined by server stripe sizes h and s.
In addition, since there is a great difference between the reads and writes in the SServers, the write request involves more operations than the reads, and in this case, the time T for two context switches between the FUSE module 10 and the daemon module 20 is the time T for switching between the contextssTime for write amplification, garbage collection, and wear leveling needs to be added.
In order to facilitate the explanation of the working principle of the cost calculation module 4, the parameters of the cost analysis mode involved in the cost calculation module 4 are presented in table form, as shown in table one.
Figure GDA0002731175200000132
Figure GDA0002731175200000141
Parameters in a table cost analysis model
Third, regional division module 5
Guided by the cost model generated by the cost calculation module 4, the region partitioning module 5 is able to partition files into different regions, in an attempt to minimize the total cost of a given access set featuring parallel applications. Unlike the HARL, which handles area division and stripe size determination in two different stages, the layout strategy of the RLFS is monolithic, and takes into account the area division and stripe size determination in a unified manner, so that the RLFS can determine the area division and stripe size at one time. RLFS does not heuristically scan the trace file for logical areas like hirl, but rather places logical areas and physical blocks together, targeting the minimum total cost. This consideration is readily understood because the smallest unit of file access is a block, e.g., 64MB or 128MB, and the logical area can naturally span a sequence of adjacent physical blocks.
A first algorithm can be executed in the region partitioning module 5, which is the most relatively fast algorithm in an off-line form, and which can be repeated periodically to adapt to the dynamics of the access. By "relatively fast" is meant that the algorithm is a pseudo-polynomial time, the essence of which is to first represent the shared file as a sequence of blocks, then partition the file in blocks according to a given access request, and finally find the optimal partition of regions from these partitions using a dynamic programming module.
Depending on the I/O event given by the access pattern, e.g. start or end I/O operation, the file F has an example of size L, defined by the number of segments (L-12 segments), and the sequence of adjacent segments is merged into regions, each region being separated by a red vertical dashed line. Where each region has data striped between HServer and SServer, logical I/O requests may be handled by a single request for multiple physical requests related to the requested data. With this layout optimization, the cost of total access is minimized according to a defined cost model, as compared to conventional strategies.
Is provided with
Figure GDA0002731175200000142
Representing the minimum cost when a file with l request events starting from index i is divided into k regions, 0 ≦ i is calculated by the following recursion<l of:
Figure GDA0002731175200000151
wherein the content of the first and second substances,
Figure GDA0002731175200000152
defines an area of size
Figure GDA0002731175200000153
Figure GDA0002731175200000154
Can be expressed as:
Figure GDA0002731175200000155
Figure GDA0002731175200000156
will be striped in HServer and SServer, h respectivelyiAnd si
The minimum cost for dividing the i events into k regions starting from event i can be obtained from the recursion
Figure GDA0002731175200000157
When m changes from l to l-i, the cost of the first area of size f is reduced
Figure GDA0002731175200000158
Is added to the rest
Figure GDA0002731175200000159
Thus, the minimum sum is calculated. When the number of segments is not sufficient to support the remaining k-region partitioning, set
Figure GDA00027311752000001510
Otherwise set up
Figure GDA00027311752000001511
Give a
Figure GDA00027311752000001512
After the definition of (2), further calculation is carried out from
Figure GDA00027311752000001513
To
Figure GDA00027311752000001514
(sub-) request of a region, then computing
Figure GDA00027311752000001515
Cost of
Figure GDA00027311752000001516
The calculation formula is as follows:
Figure GDA00027311752000001517
t (r, h) in the above formulai,si) Is defined in a cost model, assuming s isi=αhiThen, there are:
Figure GDA00027311752000001518
here, α ≧ 1 is the SServer's spreading factor relative to HServer, and B denotes the block size in the configuration.
By the 4 equations above, an optimal area partition for the file layout can be obtained and the cost of a given request minimized. Further, the file is partitioned on the underlying heterogeneous servers, where each region on the underlying heterogeneous servers corresponds to a determined stripe size. Thereafter, for each request R, the corresponding region may be read according to its stripe size, thereby satisfying the requirements.
FIG. 5 is a diagram of an exemplary application of a file system in an embodiment of the present invention. As shown in fig. 5, the file client 1 issues a request on behalf of an application from the computing server 301, the hybrid storage system 100 is responsible for storing and managing stripped areas, and the metadata server 200(MDS) contains description information of files stored in the RLFS. During file operations, client 1 first contacts the MDS to obtain file metadata and then uses it for data access with the hybrid storage system 100 through the RLFS daemon.
RLFS logically maps a large file into multiple small (area) files, each representing an area of the file with similar I/O workload. The zone file is further stripped across all hservers and sservers and each stripe is stored as a separate data file in each storage server. To this end, the MDS maintains an area stripe table (RST) for each physical file in the RLFS, as shown in Table two below, where each area of the file is recorded in terms of an offset and a stripe size in each server. When a file is written to the RLFS, a region bar table (RST) is created by the region dividing module 5, and the region bar table (RST) is updated when the access mode is changed. To improve efficiency, the RST of the file to be read when installing and uninstalling the RLFS may be cached and cached in the same directory as the application.
Figure GDA0002731175200000161
Data structure of bar chart in two areas of watch
For cost calculation module 4, one file server in the parallel file system is used to test the context switch time, unit data copy time, and unit data transfer time of HServer and SServer with read/write mode, which may vary with different I/O modes. In addition, using a pair of nodes, i.e., a client node and a file server, to estimate network transmission time, the network transmission time test may be repeated thousands of times, and then their average values are calculated as parameter values for generating the cost model.
To perform optimal data placement for a particular file, RLFS first uses its region partitioning module 5 to compute the optimal region partitions for the file, and then uses the cost model and I/O trace data to determine the stripe size for each region. The optimal region information is calculated for simultaneously writing files on each server, and the region dividing module 5 creates RSTs for subsequent reading for the files in the MDS. The MDS accommodates the namespace of RLFS, RST, and other information about each file. However, due to the limited number of regions given by the region partitioning algorithm, the size of the MDS is highly controlled and the size of the MDS is small.
In addition, to facilitate parallel reading of each file, the hybrid storage system 100 maintains a flat name space, where each file may be identified by "file _ region # _ stripe #" in the local disk. Note that "filename" may contain application-specified path information. The background I/O daemon is used to receive incoming requests from the client 1, featuring "filename", "region #" and "stripe #", serving the requests by sending back the requested strip files, which are combined with other strip files to meet the needs of the application.
The file system (RLFS) provided by the present invention supports data layout at a region level by dividing a file into a set of optimal regions, so that the optimal regions and their stripe sizes can be determined by the file system, and thus, the data layout of the hybrid storage system 100 can be optimized by the file system. Compared with a kernel method, the use of the FUSE module 10 not only greatly simplifies development work, but also allows an application program to access the RLFS in a transparent manner by allowing the RLFS to be accessed through a standard file system interface, and the RLFS with a variable size can reduce load imbalance among servers, flexibly adapt to workload changes and server heterogeneity, and thus significantly accelerate I/O system performance.
The file system (RLFS) provided by the embodiment of the invention is verified by experiments, is determined to be feasible and has excellent performance. Experimental results show that the RLFS can be well matched with the hybrid storage system 100 to run together, and the RLFS greatly improves the parallel I/O performance.
In the experiment, a comparison was made for three data layout schemes: scheme one utilizes fixed size strips; scheme two utilizes randomly selected stripes, and scheme three is realized through RLFS. For reads and writes, RLFS uses optimal data layouts of 32KB, 160KB } and 36KB, 148KB } respectively, which improves I/O performance by 73.4% and 176.7% compared to the default layout with fixed-size stripes of 64 KB. RLFS improves read performance to 138.6% and write performance to 177.6% compared to other layouts with different but fixed-size stripes. Compared with the randomly selected stripe strategy, RLFS improves the read performance to 154.5% and the write performance to 215.4%.
Experimental results based on representative benchmarks indicate that RLFS is a promising and feasible solution in hybrid parallel file systems, with parallel I/O performance increasing from 20.6% for reads to 556.1% and 22.7% for writes to 288.7%.
The invention also provides a data layout method, which comprises the following steps:
step S1, collecting the I/O information of data access during operation and the file system configuration file for cost modeling into a tracking file, orienting the file system configuration file for establishing a cost model, and using the I/O information for region division;
step S2, calculating or estimating the access cost of the file request to form a cost model;
step S3, generating a distribution area with minimized total cost according to the cost model, and dividing the files into different areas;
step S4, obtaining the size of the strip corresponding to the area.
The method has the advantages that the method supports the data layout of the region level by dividing the file into a group of optimal regions, so as to determine the optimal regions and the stripe sizes thereof, and the method optimizes the data layout of the hybrid storage system.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (8)

1. A file system is characterized in that the file system comprises an I/O tracer, a cost calculation module and a region division module which are electrically connected with each other, wherein the I/O tracer is used for providing I/O information collected by the I/O tracer when the file system runs to the region division module; the I/O tracer is also used for providing the cost calculation module with the configuration file of the file system collected by the I/O tracer; the cost calculation module is used for calculating or predicting the access cost of the file request in the file system so as to output a cost model to the region division module; the area dividing module is used for generating a distribution area with the minimized total cost according to the cost model and dividing files into different areas, and the area dividing module is also used for obtaining the size of a strip corresponding to the area;
the cost calculation module is used for calculating the total cost of the request, and the total cost calculation formula is as follows: t ═ Ts+Tc+T2In the formula, TsRepresents the time of two context switches between the FUSE module and the daemon module, TcDenotes the reproduction time, T2Representing network and storage costs.
2. The file system of claim 1, wherein the file system further comprises a kernel portion for performing interaction of information or data between a metadata server, the hybrid storage system, and a client party; the kernel portion includes a FUSE module.
3. The file system of claim 2, wherein the file system further comprises a daemon module for executing a daemon process in the background; the FUSE module is used as a proxy of the daemon.
4. The file system of claim 3, wherein the file system further comprises an update data placement module, the update data placement module being respectively coupled to the daemon module, the I/O tracer, the zoning module, and the hybrid storage system, the update data placement module being configured to dynamically detect and update a change in a zone.
5. The file system of claim 4, wherein the hybrid storage system comprises a solid state drive-based server SServer and a hard disk drive-based server HServer;
the calculation formula of the replication time is as follows: t isc(r,h,s)≈3(mh+ns)tcIn the formula tcRepresenting the unit data copying time from a kernel space to a user space, r representing the data size of a file request, h representing the size of a stripe on an HServer, s representing the size of a stripe on an SServer, m representing the number of the HServers, and n representing the number of the SServers;
the calculation formula of the network and storage cost is as follows: t is2≈Te+max{h(th+t),s(ts+ t) }, where t denotes the data transmission network time, thAnd tsRespectively representing the cell data transmission time on HServer and the cell data transmission time on SServer, TeRepresenting the network connection time.
6. The file system of claim 5, wherein the region partitioning module is to obtain the secondary data fromMinimum cost to start dividing l events into k regions for event i
Figure FDA0002731175190000021
The minimum cost
Figure FDA0002731175190000022
The calculation formula of (2) is as follows:
Figure FDA0002731175190000023
in the formula, the first step is that,
Figure FDA0002731175190000024
Figure FDA0002731175190000025
define a size of
Figure FDA0002731175190000026
The area of the image to be displayed is,
Figure FDA0002731175190000027
the cost of the first area of size f is indicated.
7. The file system of claim 6, wherein the solid state drive based server and the hard drive based server are capable of connecting
Figure FDA0002731175190000028
Striping and obtaining h respectivelyiAnd si,siIs calculated as si=αhi,hiThe calculation formula of (2) is as follows:
Figure FDA0002731175190000029
in the formula, α ≧ 1 and is the SServer spreading factor relative to HServer, B denotes the block size in the configuration.
8. A data layout method using the file system according to any one of claims 1 to 7, characterized by comprising:
step S1, collecting the I/O information of data access during operation and the file system configuration file for cost modeling into a tracking file, orienting the file system configuration file for establishing a cost model, and using the I/O information for region division;
step S2, calculating or estimating the access cost of the file request to form a cost model;
step S3, generating a distribution area with minimized total cost according to the cost model, and dividing the files into different areas;
step S4, obtaining the size of the strip corresponding to the area.
CN201811547400.9A 2018-12-18 2018-12-18 File system and data layout method Active CN109840247B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811547400.9A CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method
PCT/CN2019/121301 WO2020125362A1 (en) 2018-12-18 2019-11-27 File system and data layout method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811547400.9A CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method

Publications (2)

Publication Number Publication Date
CN109840247A CN109840247A (en) 2019-06-04
CN109840247B true CN109840247B (en) 2020-12-18

Family

ID=66883264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811547400.9A Active CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method

Country Status (2)

Country Link
CN (1) CN109840247B (en)
WO (1) WO2020125362A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840247B (en) * 2018-12-18 2020-12-18 深圳先进技术研究院 File system and data layout method
CN110825698B (en) * 2019-11-07 2021-02-09 重庆紫光华山智安科技有限公司 Metadata management method and related device
CN114578299A (en) * 2021-06-10 2022-06-03 中国人民解放军63698部队 Method and system for generating radio frequency signal by wireless remote control beacon device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104020961A (en) * 2014-05-15 2014-09-03 深圳市深信服电子科技有限公司 Distributed data storage method, device and system
CN105872031A (en) * 2016-03-26 2016-08-17 天津书生云科技有限公司 Storage system
CN106326344A (en) * 2016-08-05 2017-01-11 中国水产科学研究院东海水产研究所 Distributed massive data management and retrieval method
CN107479827A (en) * 2017-07-24 2017-12-15 上海德拓信息技术股份有限公司 A kind of mixing storage system implementation method based on IO and separated from meta-data
CN107734026A (en) * 2017-10-11 2018-02-23 郑州云海信息技术有限公司 A kind of design method, device and the equipment of network attached storage cluster
US10014028B2 (en) * 2014-10-17 2018-07-03 Panasonic Intellectual Property Corporation Of America Recording medium, playback device, and playback method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1563411B1 (en) * 2002-11-14 2013-06-19 EMC Corporation Systems and methods for restriping files in a distributed file system
JP2005302152A (en) * 2004-04-12 2005-10-27 Sony Corp Composite type storage device, data writing method, and program
US7949636B2 (en) * 2008-03-27 2011-05-24 Emc Corporation Systems and methods for a read only mode for a portion of a storage system
CN102566942A (en) * 2011-12-28 2012-07-11 华为技术有限公司 File striping writing method, device and system
US9916311B1 (en) * 2013-12-30 2018-03-13 Emc Corporation Storage of bursty data using multiple storage tiers with heterogeneous device storage
CN103778222A (en) * 2014-01-22 2014-05-07 浪潮(北京)电子信息产业有限公司 File storage method and system for distributed file system
US9772787B2 (en) * 2014-03-31 2017-09-26 Amazon Technologies, Inc. File storage using variable stripe sizes
CN105760164B (en) * 2016-02-15 2020-01-10 苏州浪潮智能科技有限公司 Method for realizing ACL authority in user space file system
CN106528761B (en) * 2016-11-04 2019-06-18 郑州云海信息技术有限公司 A kind of file caching method and device
CN109840247B (en) * 2018-12-18 2020-12-18 深圳先进技术研究院 File system and data layout method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104020961A (en) * 2014-05-15 2014-09-03 深圳市深信服电子科技有限公司 Distributed data storage method, device and system
US10014028B2 (en) * 2014-10-17 2018-07-03 Panasonic Intellectual Property Corporation Of America Recording medium, playback device, and playback method
CN105872031A (en) * 2016-03-26 2016-08-17 天津书生云科技有限公司 Storage system
CN106326344A (en) * 2016-08-05 2017-01-11 中国水产科学研究院东海水产研究所 Distributed massive data management and retrieval method
CN107479827A (en) * 2017-07-24 2017-12-15 上海德拓信息技术股份有限公司 A kind of mixing storage system implementation method based on IO and separated from meta-data
CN107734026A (en) * 2017-10-11 2018-02-23 郑州云海信息技术有限公司 A kind of design method, device and the equipment of network attached storage cluster

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于FUSE的用户态文件系统的设计与实现";黄永胜;《东北大学硕士毕业论文》;20150817;第8-45页 *

Also Published As

Publication number Publication date
WO2020125362A1 (en) 2020-06-25
CN109840247A (en) 2019-06-04

Similar Documents

Publication Publication Date Title
US10657101B2 (en) Techniques for implementing hybrid flash/HDD-based virtual disk files
US6640281B2 (en) Storage subsystem with management site changing function
US6915403B2 (en) Apparatus and method for logical volume reallocation
US6715054B2 (en) Dynamic reallocation of physical storage
CN109840247B (en) File system and data layout method
KR102318477B1 (en) Stream identifier based storage system for managing array of ssds
US9323682B1 (en) Non-intrusive automated storage tiering using information of front end storage activities
Yang et al. AutoReplica: automatic data replica manager in distributed caching and data processing systems
US8959173B1 (en) Non-disruptive load-balancing of virtual machines between data centers
US9557933B1 (en) Selective migration of physical data
US7774094B2 (en) Selecting a source cluster by measuring system factors, calculating a mount-to-dismount lifespan, and selecting the source cluster in response to the lifespan and a user policy
JP2004070403A (en) File storage destination volume control method
JP2005196602A (en) System configuration changing method in unshared type database management system
EP2254036A2 (en) Storage apparatus and data copy method
Li et al. SCALLA: A platform for scalable one-pass analytics using MapReduce
JP2005056077A (en) Database control method
CN111708719B (en) Computer storage acceleration method, electronic equipment and storage medium
US8516070B2 (en) Computer program and method for balancing processing load in storage system, and apparatus for managing storage devices
WO2019226365A1 (en) Scalable multi-tier storage structures and techniques for accessing entries therein
US8131966B2 (en) System and method for storage structure reorganization
Li et al. Elastic and stable compaction for LSM-tree: A FaaS-based approach on TerarkDB
US10552342B1 (en) Application level coordination for automated multi-tiering system in a federated environment
CN111708601A (en) Cloud computing-oriented intelligent storage gateway, system and control method thereof
US11494383B2 (en) Database management system and database management method
Menon et al. Logstore: A workload-aware, adaptable key-value store on hybrid storage systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant