CN106874383B

CN106874383B - Decoupling distribution method of metadata of distributed file system

Info

Publication number: CN106874383B
Application number: CN201710016284.7A
Authority: CN
Inventors: 陆游游; 舒继武; 李思阳
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-01-10
Filing date: 2017-01-10
Publication date: 2019-12-20
Anticipated expiration: 2037-01-10
Also published as: CN106874383A

Abstract

The invention discloses a decoupling distribution method of metadata of a distributed file system, which comprises the following steps: separating metadata of the distributed file system to obtain metadata of a directory, metadata of a directory entry and metadata of a file; the directory metadata is stored centrally at a directory metadata inode and does not contain pointers to directory entries. And executing the directory operation according to the directory index node. After each directory entry metadata is divided, the metadata and the related file metadata are stored in the same node, and a reverse index pointing to the directory metadata is established. The invention has the following advantages: the method reduces the information interaction among nodes when the distributed file system accesses the metadata, reduces the delay of metadata access, and meanwhile decouples the strong correlation between the file and the directory through a method of separating the directory content, so that high throughput can be achieved, and the processing efficiency of the distributed file system on the metadata is improved.

Description

Decoupling distribution method of metadata of distributed file system

Technical Field

The invention relates to the field of computers, in particular to a decoupling distribution method of metadata of a distributed file system.

Background

The distributed file system is a novel storage system supporting mass data storage, and is widely applied to data centers, super computing centers and public cloud platforms. Distributed file systems have many good advantages over traditional centralized storage. If the storage data can be expanded horizontally, the storage capacity can be expanded dynamically by increasing the storage nodes, and the synchronous improvement of the access throughput is ensured. Secondly, the distributed file system has a flexible fault-tolerant strategy compared with the traditional centralized storage, and a copy mechanism and erasure codes can be used for carrying out distributed fault tolerance. Distributed file systems may also use less expensive storage and computing devices to construct a large-scale storage cluster to ensure access to large amounts of data. But limited by the file system's access standard (POSIX), metadata access for distributed file systems tends to become a bottleneck in their performance. The metadata access cannot meet the requirements of high throughput and low delay, but in an actual system, more than half of the data access needs to pass through the metadata node. In order to solve the scalability of metadata of a distributed file system, the following three techniques are mainly used in the prior art:

a dynamic directory tree based distributed metadata node expansion method features that the name space of distributed file system is divided into different subtrees based on sub-directory, each subtree is stored in a certain node independently, and the accessed nodes are dynamically regulated by load. The method has the advantages that the access position can be dynamically adjusted according to the synchronization of the load, but the method cannot solve the problem of backtracking of the path of the file access, when one file is accessed, all directories of the whole path need to be accessed, and the directories are not stored in the same node, which often causes great access delay.

The other method is a metadata expansion method based on a hash algorithm, and is characterized in that the metadata of the files in a directory are distributed to different nodes in a hash mode. This has the advantage of reducing the load of file access when there are a large number of files in a directory. But fails to address the problem of scalability of the directory.

The third method is a method for storing file metadata by using a key value database, which utilizes the characteristics of fast access and low time delay of the key value database, but the method still has the problem of path search as the path search problem of the first method, and still cannot solve the problem of low access time delay.

In order to solve the problem of path delay, metadata is cached in the client side by the methods, but the method brings much overhead of inconsistency, so that the problem cannot be solved more fundamentally.

Disclosure of Invention

The present invention is directed to solving at least one of the above problems.

Therefore, an object of the present invention is to provide a method for decoupling and distributing metadata of a distributed file system, so as to solve the problems of metadata extensibility, low throughput and low latency of the distributed file system.

In order to achieve the above object, an embodiment of the present invention discloses a method for decoupling and distributing metadata of a distributed file system, including the following steps: s1: separating metadata of the distributed file system to obtain metadata of directory index nodes, metadata of directory entries and metadata of files; s2: setting the metadata of the directory at a directory inode; s3: and dividing each directory entry according to the distribution condition of the files, storing the directory entries related to the directory entries in the nodes stored in the files, and establishing a reverse index pointing to the metadata of the directory.

Further, the directory operation includes creation of a directory, deletion of a directory, reading of a directory, obtaining of all metadata of a directory, changing of a user group in which a directory is located, and changing of a user to which a directory belongs.

Further, still include: providing an identification of a globally unique certain file; calculating the hash value of the global identification of the file to be accessed; and positioning the nodes stored by the metadata according to the hash value.

Further, the identification is a complete path of the file.

Further, still include: when a file or a directory is created, creating all directory entries containing a parent directory path of the file or the directory at a node where the file or the directory is created; if all or part of the directory entry has already been created at that node, the remaining directory entries are created.

Further, still include: when a file is deleted, deleting the metadata of the node where the file is located and the directory entry metadata corresponding to the node where the file is located to point to the entry of the file.

Further, still include: when a read directory or delete directory operation is performed, all metadata nodes are accessed to obtain all directory entries under the read directory or delete directory.

Further, still include: providing a client cache, wherein the directory metadata cached by the client is used for determining whether the client has the authority of creating the file when the client creates the file; when the client accesses the metadata of the file, the client accesses the metadata of the directory to acquire access authority; when the client has access rights, metadata of the file is accessed.

Further, still include: and when the cache of the catalog metadata client fails, the permission of the catalog metadata is changed and the catalog is deleted.

According to the decoupling distribution method of the metadata of the distributed file system, all metadata operations of files access nodes for 2 times at most, only one node needs to be accessed under the condition of caching of the directory metadata, the time delay is only RTT of one-time access round trip, due to the fact that key value storage is used, the time delay for acquiring the metadata is very low, and the RTT time delay of the Ethernet can be ignored, therefore, the method can effectively reduce information interaction among the nodes when the metadata is accessed by the distributed file system, the delay of metadata access is reduced, meanwhile, through the method of separating directory contents, the strong correlation between the files and the directories is decoupled, high throughput can be achieved, and the processing efficiency of the distributed file system on the metadata is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a method for decoupled distribution of distributed file system metadata according to an embodiment of the present invention;

FIG. 2 is an overall system architecture diagram of one embodiment of the present invention;

FIG. 3 is a schematic diagram of directory partitioning decoupling according to one embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

The invention is described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for decoupled distribution of distributed file system metadata according to an embodiment of the present invention. As shown in fig. 1, a method for decoupling and distributing metadata of a distributed file system according to an embodiment of the present invention includes the following steps:

s1: and separating the metadata of the distributed file system to obtain the metadata of the directory, the metadata of the directory entry and the metadata of the file.

S2: and setting the metadata of the directory at a directory inode.

S3: and dividing each directory entry according to the distribution condition of the files, storing the directory entries related to the directory entries in the nodes stored in the files, and establishing a reverse index pointing to the metadata of the directory.

In one embodiment of the invention, directory metadata for a file system is centrally stored on one node. In this manner, the metadata information for the directory inode does not contain an address to the directory entry metadata. Only basic directory data is reserved, including but not limited to creation time of the directory, authority identification of the directory, group identification of the directory, and user identification of the directory. On this basis, for most metadata operations that are related to the metadata of the directory inode and not to the directory entry metadata, the operations will be performed only on this node that stores the directory inode metadata. The directory operation comprises creation of a directory, deletion of the directory, reading of the directory, acquisition of all metadata of the directory, change of user groups where the directory is located and change of users where the directory belongs.

In one embodiment of the invention, a hash-based distributed file metadata storage mechanism is also included. The storage mechanism supports the storage and access of file metadata to be expanded to a plurality of nodes, so that the aim of balancing system load is fulfilled. This algorithm uses an identification that can uniquely determine a file globally: when a client operates the metadata of a file, the client locates a node where the file is stored by calculating the hash value of the globally unique identifier of the file to be accessed, and operates the metadata at the node. This approach ensures that operations on all file metadata will modify at most one metadata node's information.

In one embodiment of the invention, the identification is a complete path of the file.

In one embodiment of the invention, the method also comprises a method for reversely storing the directory entry. The method ensures that extra node access overhead is not required to be added in the process of creating and deleting the file by dividing the metadata of the directory entry and then distributing the metadata to a plurality of nodes. The allocation method is that when creating a file or a directory, metadata information on nodes except a node creating the file or the directory is not required to be modified, and all directory entries containing a parent directory path of the file or the directory are created at the node creating the file or the directory. If all or part of these directory entries have already been created at this node, the remaining directory entries are created. The directory entry ensures that when a file is created and deleted, a plurality of metadata nodes are not modified, and distributed synchronization overhead caused by accessing a plurality of nodes is reduced.

In one embodiment of the present invention, when a file is deleted, the metadata of the node where the file is located and the directory entry metadata corresponding to the node where the file is located point to the entry of the file.

In one embodiment of the present invention, further comprising: when a read directory or delete directory operation is performed, all metadata nodes are accessed to obtain all directory entries under the read directory or delete directory.

In one embodiment of the present invention, further comprising: providing a client cache, wherein the directory metadata cached by the client is used for determining whether the client has the authority of creating the file when the client creates the file; when accessing the metadata of the file, the client accesses the metadata of the directory to acquire access authority; when the client has access rights, the metadata of the file is accessed.

In one embodiment of the present invention, further comprising: and when the cache of the catalog metadata client fails, the permission of the catalog metadata is changed and the catalog is deleted.

In order that those skilled in the art will further understand the present invention, the following examples are given for illustration and description.

As shown in FIG. 2, a metadata node for storing directories is provided for handling all requests for directory metadata in a distributed file system. In implementation, it receives requests from clients using an ethernet-based RPC protocol by providing a standard POSIX access interface. It is divided into four modules, one is an access interface for executing metadata request operation, and the other is a key value storage for persisting metadata to disk. In addition, a module for storing the directory contents caches the directory contents of each node. When the metadata node is started, it scans each value in the key-value store to build the contents of the directory.

In the process of the directory metadata, the directory metadata includes creation time of the directory, authority identification of the directory, group identification of the directory, user identification of the directory, and global unique identification of the directory. The path length of the directory is a variable-length character string, and the metadata of the rest directories are respectively a fixed-length identifier of 8 bytes. The metadata node supports relevant operations related to the directory, including creation of the directory, deletion of the directory, reading of the directory, acquisition of all metadata of the directory, change of user groups in which the directory is located, and change of users to which the directory belongs.

The creating process of the directory comprises the steps of receiving a request for creating the directory from a client, wherein the request comprises a path for creating the directory by the client, a permission mode for creating the directory by the client, and a user group identifier and a user identifier for creating the directory by the client. The permission mode of the directory comprises the read-write permission of the owner of the directory, the read-write permission of the user group and the read-write permission of other users. After receiving the requests at the node end, the authority check and confirmation are firstly carried out on the parent directory of the directory created by the client, and when the creation condition is determined to be met, the metadata of the directory is written into the storage of the directory metadata node. The creation conditions which are met comprise that the parent directory which is formulated by the creation conditions exists and the access right of the parent directory is provided.

For the deletion of the directory, all metadata of the directory are read, the user group where the directory is located is changed, and the attributes of the directory are changed. When the two strips are determined, the corresponding operation is performed on the directory. More specifically, when a directory is deleted, it is necessary to determine that its subdirectories and the files of the subdirectories have been deleted.

In the whole architecture, the directory metadata node is only 1 logically in the whole system. All directory metadata is stored in a metadata node. To ensure its reliability, the method of synchronous backup may be used and the method of distributed extension may not be used.

FIG. 2 also depicts a node storing metadata for a file, which is used to process all file-related metadata requests for the distributed file system and to store the metadata for the distributed file system. The system comprises a standard POSIX access interface supporting file operation, a metadata management layer for classifying and storing metadata, and a key value storage for persisting the metadata to a disk.

The metadata of each file comprises the globally unique identifier of its parent directory, the file name, the access time of the file, the file authority, the user group identifier of the file, the file modification time, the access time of the file content, the file size, the file block size, the global identifier of the metadata node for creating the file, and the metadata node for creating the file. Wherein the globally unique identifier of its parent directory and the name of the file name are combined to form a growing string. The other metadata fields are 8 bytes long. The metadata supports the related operations of file metadata, including file creation, file deletion, file metadata acquisition, file user group modification, file right modification, file reading and writing, and file size change.

The file creating process is divided into two stages, wherein in the first stage, a client needs to firstly access a directory to which the created file belongs, and the client needs to access the directory metadata node in the first aspect to determine whether the client has the authority to create the file in the directory. And when the file is determined not to exist on the node, the file is successfully created.

For the metadata operation of other files, the file needs to be found first, then whether the file has the access right is determined, and if the file has the access right, the file is modified correspondingly. In these modifications, the modification of reading and writing the file must wait until the data of the file has been written into the backend storage system or successfully read from the backend storage system, so as to complete the data modification, and this process needs to be defined as a transaction process.

The metadata node is a network node based on RPC communication, and a corresponding service port needs to be opened during initialization, and a database for storing directory metadata is started at the same time. For the processing of metadata operation requests, the requests are processed concurrently in a multi-thread manner.

As shown in fig. 2, there are logically only a plurality of file metadata nodes in the entire system. The metadata of all files determines the location of the node where the file is stored by computing a string of unique identifiers and filenames of its parent directory. This calculation method uses consistent hashing algorithm guarantees. The method can ensure better expansibility and can support dynamic metadata nodes for expanding files.

As shown in fig. 2, the architecture further includes a distributed file system client with cache. On this client, the application may access the distributed storage by directly accessing the library provided by the distributed file system. In the Linux system, a user can call the library through a user-mode file system (FUSE) and then directly mount the file system on a disk at a client. When the client is started, a directory cache is established at the client, and the directory cache is placed on the access library.

When the client accesses, different access strategies are constructed for different access processes through the RPC protocol. For the metadata operation of the directory, including mkdir creating the directory, getattr gets the attributes of a file, directory, or folder (here only the metadata of the directory), chmod file/directory permission setting command, chown is a multi-user multitask operating system, and all files have their owners (Owner). The client will directly access the directory metadata node and will give different parameters for different operations. And the node end directly processes the request and then returns the request to the client. For the metadata operation of the file, including open, read, write, getattr, truncate, etc., it first needs to access the directory cache, check whether its parent directory has access right, if there is no relevant information in the directory cache, the client will first access the directory metadata node, cache the metadata of the parent directory in the directory cache of the client, then check the right, after determining that the right is possessed, the client will initiate the metadata operation request with the file metadata node. It should be noted that for readdir (read directory content) operation, the client will initiate a request for reading directory content for all file metadata nodes and directory metadata nodes, and these nodes will return the respective stored directory content to the client, and the client will organize the contents into an ordered list and return the list to the user. For the rmdir operation, the client needs to initiate a request for obtaining the directory content to all the file metadata nodes and directory metadata nodes, and when all returned requests are empty, it is determined that the deletion of the directory is legal, and then a request for deleting the directory can be initiated to the file metadata nodes.

With respect to caching of clients, examples thereof also need to handle the following situations. Since some clients modify the content of the directory in the running process, the cached directory metadata needs to be set with the invalidation time of the cache, and the modification of the directory authority of the directory metadata and the deletion of the directory need to wait for the cache invalidation of all the directory metadata clients to continue the operation. Therefore, when a certain client sends the two requests, the requests are not executed immediately, and the update and deletion of the directory are not carried out until the cache invalidation time set by the directory metadata node is overtime.

The client shown in fig. 2 also needs a configuration file to store the global map. The global map comprises the IP addresses of the directory metadata nodes and the file metadata nodes, the global unique number of each file metadata node is calculated and determined by the client through a hash algorithm based on the global map and the number, and meanwhile, information exchange can be carried out between the client and the corresponding metadata nodes through a network. In the process of client initialization, the client reads in the global map, and the directory metadata node and the file metadata node establish a heartbeat link, and each period of time determines whether the directory and file metadata nodes still work normally.

As shown in fig. 3, in the process of organizing the metadata, the metadata of the file needs to be segmented, and in this process, the present example defines 4 metadata structures, including the metadata (d-inode) of the directory, the content (d-entry) of the directory, the metadata (f-inode) of the file, and the content (f-content) of the file. In a conventional file system, d-inodes can index into f-inodes through d-entries, and the f-inodes can index into specific contents of files. In this example, however, a new metadata organization method is proposed. The metadata segmentation method comprises the step that each node self-organizes the content of the self-directory. As shown in FIG. 2, the native d-entry is assigned to each d-inode by decoupling. Therefore, the strong coupling relation of the d-entry to the file and the directory is solved. Specifically, in the implementation process, when creating a file, it first reads its parent directory to determine that the directory can be created, and as described in the foregoing method, the metadata of its parent directory can be obtained through the directory cache of the client. The client distributes the file to a file metadata node for storage, and the distribution method is a distribution method based on consistent hash. And when the file is stored, the file is added into the directory content cache of the node where the file is located in the directory content cache of the file metadata, so that interaction with other nodes is avoided, and access delay is reduced. When the directory content needs to be acquired, the decoupled directory content can be aggregated into a complete directory content to be sent to the application program only by accessing each node.

As shown in fig. 2, in the execution process of the example, metadata needs to be persisted by the key-value database, and the example uses a method for storing metadata by the key-value database, in which the name or the global unique identifier of the file or the directory is used as a key, and the metadata of the file or the directory is used as a value and is stored in the key-value database.

In the directory metadata node, the key value database uses the path of the file as a key for searching, and stores the following metadata as a value in the directory metadata node.

In the file metadata node, a character string composed of a globally unique identifier of a parent directory and a file name is used as a key for searching, and other metadata is used as a value and stored in the file metadata node.

In the client of the distributed file system, the metadata is cached by using the same key value mode as the directory metadata node. The specific process is that when storing the metadata of a directory, the path of the directory is used as a key, other metadata of the directory is used as a value for storage, and meanwhile, a globally unique identifier of the directory is added at the rear end of a key value, and the identifier is managed by a metadata node end of the directory. When a file is created, the unique identifier and file name of its parent directory are used as keys, and the rest of the metadata of the file is stored as values. Since the files are distributed at each node, creating and managing a unique identifier for the files by using a unique node inevitably makes the node a performance bottleneck. To this end, this example uses the identification of the node that created the file and the unique identification on this node created by the metadata node that created the file for this file to constitute a globally unique file identification. This identification is changed later on, and neither renaming nor moving path changes. This identification is stored at the end of the key value when the metadata of the file is saved.

In the internal data structure of the key-value store, this example uses different key-value stores for different data services. The directory metadata node uses a key value storage database based on a B tree, the file metadata node uses a key value storage database based on a hash, and a client of the distributed file system uses a database based on a memory to store metadata cache of a directory.

In optimizing for metadata storage, the present example uses keys of indefinite length and values of definite length. This approach requires that the fields of each metadata be deterministic. Wherein each value of the metadata field of the directory metadata is of a fixed length. Each value of the metadata field of the file metadata node is also of fixed length. During storage, a field with a fixed length is directly stored in the value of the key value storage, and serialization and deserialization are not carried out. In addition, the method does not need an additional memory data structure to cache the metadata during storage, and the metadata is directly cached in the cache of the key-value database.

On the realization result of the invention, all metadata operations for files access nodes for 2 times at most, and under the condition of caching the directory metadata, only one node needs to be accessed, the delay is only RTT of one round trip access, and because key value storage is used, the delay for acquiring the metadata is very low and can be ignored on the RTT delay of the Ethernet, so the method can effectively reduce the information interaction between the nodes when the distributed file system accesses the metadata, and reduce the delay of metadata access.

In addition, other configurations and functions of the method for decoupling and distributing metadata of a distributed file system according to the embodiment of the present invention are known to those skilled in the art, and are not described in detail for reducing redundancy.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for decoupling and distributing metadata of a distributed file system is characterized by comprising the following steps:

s1: separating metadata of the distributed file system to obtain metadata of a directory, metadata of a directory entry and metadata of a file;

s2: setting the metadata of the directory at a directory inode;

s3: dividing each directory entry according to the distribution condition of the files, storing the directory entries related to the directory entries in the nodes stored in the files, and establishing a reverse index pointing to the metadata of the directory;

further comprising: when a file or a directory is created, creating all directory entries containing a parent directory path of the file or the directory at a node where the file or the directory is created; if all or part of the directory entry has already been created at that node, the remaining directory entries are created.

2. The method of claim 1, wherein the directory operations include creating a directory, deleting a directory, reading a directory, obtaining all metadata of a directory, changing a user group in which the directory is located, and changing a user to which the directory belongs.

3. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

providing an identification of a globally unique certain file;

calculating the hash value of the global identification of the file to be accessed;

and positioning the nodes stored by the metadata according to the hash value.

4. The method of claim 3, wherein the identifier is a complete path of the file.

5. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

when a file is deleted, deleting the metadata of the node where the file is located and the directory entry metadata corresponding to the node where the file is located to point to the entry of the file.

6. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

when a read directory or delete directory operation is performed, all metadata nodes are accessed to obtain all directory entries under the read directory or delete directory.

7. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

providing a client cache, wherein the directory metadata cached by the client is used for determining whether the client has the authority of creating the file when the client creates the file;

when the client accesses the metadata of the file, the client accesses the metadata of the directory to acquire access authority;

when the client has access rights, metadata of the file is accessed.

8. The method for decoupled distribution of distributed file system metadata according to claim 7, further comprising:

and when the cache of the catalog metadata client fails, the permission of the catalog metadata is changed and the catalog is deleted.