WO2020118955A1 - 分布式元数据管理一致性保证方法、装置、系统及应用 - Google Patents

分布式元数据管理一致性保证方法、装置、系统及应用 Download PDF

Info

Publication number
WO2020118955A1
WO2020118955A1 PCT/CN2019/078843 CN2019078843W WO2020118955A1 WO 2020118955 A1 WO2020118955 A1 WO 2020118955A1 CN 2019078843 W CN2019078843 W CN 2019078843W WO 2020118955 A1 WO2020118955 A1 WO 2020118955A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
client
consistency
update
node
Prior art date
Application number
PCT/CN2019/078843
Other languages
English (en)
French (fr)
Inventor
卢宇彤
陈志广
肖侬
刘芳
Original Assignee
中山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学 filed Critical 中山大学
Priority to US17/274,434 priority Critical patent/US20220050809A1/en
Publication of WO2020118955A1 publication Critical patent/WO2020118955A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • G06F16/125File system administration, e.g. details of archiving or snapshots using management policies characterised by the use of retention policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems

Definitions

  • the present invention relates to the field of file systems for large-scale data storage, and in particular to a method, device, system and application for ensuring consistency of distributed metadata management, for introducing consistency nodes in a metadata cluster of a distributed file system, Thereby ensuring metadata consistency with very low overhead.
  • GlusterFS uses a decentralized architecture, and any node in the file system cluster can be used as a metadata server.
  • the above methods provide solutions for implementing scalable metadata services, but little attention is paid to the problem of data consistency among multiple metadata servers.
  • the consistency of file system metadata has been a topic of long-term concern in academia. Behind most of the metadata update operations involves multiple write operations to the disk. These write operations will cause system failures at any time during the execution process. When a system failure occurs, if only part of the write operation is persisted to the disk, data inconsistency will occur after the system restarts.
  • the current file system generally uses a log mechanism to ensure metadata consistency. That is: when the upper-layer application requests the file system to update metadata, all write operations corresponding to the metadata update are first assembled into a transaction, and then the transaction is encapsulated into a unique write operation, which can be performed atomically.
  • the corresponding metadata update can be finally reflected in the file system; if the system fails during the execution of the write operation, the corresponding metadata update will eventually fail. Any changes to the file system.
  • the log mechanism can ensure that the metadata update to the file system is either completely successful or failed to execute will not have any impact on the file system, and will not cause inconsistencies in the file system metadata.
  • metadata update operations are submitted on the client side and executed by a remote metadata server.
  • the client generally sends the update operation to the metadata server.
  • the metadata server persists the update log to the storage device, and then sends a confirmation message to the client.
  • the client receives the confirmation After the message, it can be determined that the update operation can be successfully performed.
  • the distributed file system has only one metadata server, the above mechanism can achieve consistent update of metadata after two network interactions.
  • the current distributed file system prefers to use multiple metadata servers in order to improve scalability.
  • the "two-phase commit” protocol is widely used to ensure the consistency of distributed transactions.
  • the specific approach of the "two-phase commit" protocol is that the client sends update requests to each involved metadata server separately, and these metadata servers receive the update requests and persist the update operations to their storage devices in the form of logs. Then separately notify the client that log persistence has been completed on the metadata server. After the client receives the successful log persistence messages from all metadata servers, it determines that the update operation can be successfully executed on all involved metadata servers.
  • a distributed metadata management consistency guarantee method, device, system and application are provided.
  • the present invention can guarantee the consistency of distributed metadata management. While maintaining the consistency of metadata updates, it is possible to improve metadata access performance as much as possible.
  • a distributed metadata management consistency guarantee method the implementation steps include:
  • step 2) the detailed steps of step 2) include:
  • the invention also provides a distributed metadata management consistency guarantee device, including:
  • the operation request judgment program unit is used to intercept the metadata operation request from the client. If the metadata operation request is a metadata update operation, jump to execute the update operation processing program unit; otherwise if the metadata operation request is a metadata read operation, Then jump to execute the read operation processing program unit;
  • the update operation processing unit is used to return a successful metadata update message to the client and sequentially record the metadata update request, and mark the old metadata stored in the metadata server where the metadata is located as invalid; and mark the sequentially recorded metadata
  • the data update request is asynchronously synchronized to the metadata server where the metadata is located and the obsolete mark of the synchronized metadata is deleted;
  • the read operation handler unit is used to return to the client metadata that has not yet completed asynchronous synchronization.
  • the present invention also provides a distributed metadata management consistency guarantee device, including at least one computer equipment consistency guarantee node, the consistency guarantee node is programmed to perform the present invention distributed metadata management consistency guarantee Method steps.
  • the invention also provides a distributed metadata management consistency guarantee system, including a client and at least one metadata server, and the aforementioned consistency guarantee node of the invention is connected between the client and the metadata server.
  • the invention also provides an application method of the foregoing distributed metadata management consistency guarantee system, and the implementation steps include:
  • the client judges the type of metadata operation request to be initiated. If the metadata operation request is a metadata update operation, skip to step S2); otherwise, if the metadata operation request is a metadata read operation, skip Go to step S3);
  • the client selects a consistency guarantee node and sends a metadata update operation to the selected consistency guarantee node. After receiving the message that the consistency guarantee node returns a successful metadata update, it ends and exits;
  • the target metadata server judges whether the target metadata of the metadata read operation has a void mark. If the target metadata does not have a void mark, it returns the client target metadata to the client, and the client receives the returned target metadata Then end and exit; if the target metadata has an invalidation mark, return the target metadata invalidation message to the client; after receiving the returned target metadata invalidation message, the client jumps to step S5);
  • the client selects a consistency guarantee node, and sends a metadata read operation for the marked invalid metadata to the selected consistency guarantee node, and after receiving the consistency guarantee node, the asynchronous synchronization has not been returned to the client. After the metadata, end and exit.
  • the client selects a consistency guarantee node in step S2) and step S5)
  • the corresponding consistency node is specifically selected according to the file name of the metadata update operation or metadata read operation.
  • the selection of the corresponding consistency node according to the file name of the metadata update operation or metadata read operation specifically refers to the use of the hash function h(x) to select the file name filename of the metadata update operation or metadata read operation
  • the present invention has the following advantages:
  • the present invention can realize metadata read and write distribution, reduce the burst load of the metadata server to a certain extent, and give full play to the performance of the solid state disk.
  • the metadata update operation is first sent to the consistency node, and later updated asynchronously to the metadata server, and the metadata acquisition operation is directly sent to the metadata server.
  • the read load on the metadata server is heavy, the latest data on the consistency node can be temporarily updated to the metadata server. This strategy achieves the read and write distribution of metadata to a certain extent.
  • Read and write offloading brings at least two advantages: 1 postpone the write load when the read load is heavy, and reduce the peak load of the metadata server; 2 realize the read and write offloading on the solid state disk to fully utilize the concurrent advantages of the solid state disk. Helps improve the performance of metadata servers that use solid-state disks as storage devices.
  • FIG. 1 is a schematic structural diagram of a metadata cluster in an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of a method according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a metadata update process in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a metadata reading process in an embodiment of the present invention.
  • the traditional large-scale file system is composed of three types of nodes: client, metadata server, and data server.
  • the client is the initiator of all read and write requests
  • the metadata server is responsible for storing file metadata and responding to metadata requests sent by the client
  • the data server is responsible for storing file data and responding to data requests sent by the client.
  • This embodiment introduces a new type of node on the above basic architecture of a traditional large-scale file system: a consistent node (Consistent Node), which is used to perform metadata management consistency guarantee.
  • the specific architecture is shown in FIG. 1.
  • Figure 1 lists a file system organization structure including 4 clients, 3 metadata servers, and 2 consistency nodes, where dirty represents dirty data, and the data server is ignored because it does not involve the related technology of the present invention.
  • the functions of the client and the metadata server are the same as the traditional file system.
  • the consistency node and the metadata server together form a metadata cluster, where the consistency node is mainly used to update the metadata at a lower cost Ensure metadata consistency.
  • the implementation steps of the distributed metadata management consistency guarantee method of this embodiment include:
  • the distributed metadata management consistency guarantee method first proposes the above-mentioned large-scale file system architecture that introduces consistency nodes, based on which a new set of metadata reading and writing processes is designed, taking into account consistency and high performance.
  • step 2 As shown in Figure 3, the detailed steps of step 2) include:
  • this embodiment also provides a distributed metadata management consistency guarantee device, including:
  • the operation request judgment program unit is used to intercept the metadata operation request from the client. If the metadata operation request is a metadata update operation, jump to execute the update operation processing program unit; otherwise if the metadata operation request is a metadata read operation, Then jump to execute the read operation processing program unit;
  • the update operation processing unit is used to return a successful metadata update message to the client and sequentially record the metadata update request, and mark the old metadata stored in the metadata server where the metadata is located as invalid; and mark the sequentially recorded metadata
  • the data update request is asynchronously synchronized to the metadata server where the metadata is located and the obsolete mark of the synchronized metadata is deleted;
  • the read operation handler unit is used to return to the client metadata that has not yet completed asynchronous synchronization.
  • this embodiment also provides a distributed metadata management consistency guarantee device, which includes a consistency guarantee node composed of at least one computer device, and the consistency guarantee node is programmed to perform the foregoing distribution of this embodiment The steps of the method for ensuring consistency of metadata management.
  • this embodiment also provides an application method of the foregoing distributed metadata management consistency guarantee system, and the implementation steps include:
  • the client judges the type of metadata operation request to be initiated. If the metadata operation request is a metadata update operation, skip to step S2); otherwise, if the metadata operation request is a metadata read operation, skip Go to step S3);
  • the client selects a consistency guarantee node and sends a metadata update operation to the selected consistency guarantee node. After receiving the message that the consistency guarantee node returns a successful metadata update, it ends and exits;
  • the target metadata server judges whether the target metadata of the metadata read operation has a void mark. If the target metadata does not have a void mark, it returns the client target metadata to the client, and the client receives the returned target metadata Then end and exit; if the target metadata has an invalidation mark, return the target metadata invalidation message to the client; after receiving the returned target metadata invalidation message, the client jumps to step S5);
  • the client selects a consistency guarantee node, and sends a metadata read operation for the marked invalid metadata to the selected consistency guarantee node, and after receiving the consistency guarantee node, the asynchronous synchronization has not been returned to the client. After the metadata, end and exit.
  • the client when the client selects a consistency guarantee node in step S2) and step S5), it specifically selects the corresponding consistency node according to the file name of the metadata update operation or metadata read operation. Load balancing of consistent nodes.
  • selecting the corresponding consistency node according to the file name of the metadata update operation or metadata read operation specifically refers to selecting the file name filename of the metadata update operation or metadata read operation using a hash function h(x)
  • the consistency node numbered h(filename)%N, where% is the remainder symbol, and N is the number of consistency nodes in the metadata cluster.
  • the client application issues a metadata update request through a system call, which is embedded in the client of the distributed file system designed by the present invention through a virtual file system.
  • the update log is first sent to the consistency node (in contrast, the traditional distributed file system consists of the client Send an update request directly to the metadata server).
  • the consistency node quickly persists it to the local storage device. Once the persistence operation is successful, it can return a confirmation message that the metadata update operation is successful to the client.
  • Sequential write operations can achieve better performance on all storage devices, further reducing the delay of metadata update and persistence. So far, the metadata update sent by the client is only reflected on the consistency node, and has not been submitted to the metadata server. However, from the perspective of the entire metadata cluster, the update status has been recorded under the condition of ensuring consistency and durability, and the remaining work is to realize the data between the consistency node and the metadata server within the metadata cluster Synchronize. Since the consistency node has informed the client that the updated metadata has been persisted into the metadata cluster, the client does not have to wait for the updated metadata to be synchronized from the consistency node to the metadata server, and can directly switch to other tasks.
  • the data synchronization from the consistency node to the metadata server is not on the critical path of IO, and can be performed asynchronously.
  • the update log on the consistency node is submitted to the metadata server. Since the data synchronization from the consistency node to the metadata server is performed asynchronously, the update status of the metadata server is slightly lagging behind that of the consistency node. During a period when the consistency node and the metadata server are not synchronized, the client cannot obtain the latest metadata when reading the metadata server.
  • the consistency node immediately sends a notification to the metadata server to inform that the metadata update has been persisted to consistency On the node, but will be synchronized to the metadata server.
  • the above notification can be completed through a network interaction, and the metadata server only needs to record this information in the memory after receiving the notification, without reading or writing to the storage device, so it will not bring too much overhead.
  • the metadata update request for the sequential recording performs asynchronous data synchronization with the metadata server where the metadata is located, it can be synchronized to the metadata server when the metadata server is relatively idle.
  • the metadata cluster designed in this embodiment includes a consistency node and a metadata server, and the data between the two types of nodes may be out of synchronization, but the data out of synchronization can only occur on the most recently updated metadata.
  • the latest version of the data is still stored on the metadata server. Therefore, when the client initiates a metadata read request, it still interacts with the metadata server first.
  • tags In the memory of the metadata server, there are tags to indicate which metadata has been persisted to the consistency node, but has not been synchronized to the metadata server.
  • the metadata server actively obtains the latest metadata from the consistency node and returns it to the client, and at the same time updates the metadata saved by itself to the latest state; if the metadata server saves The metadata itself is the latest version (that is, there is no mark in memory indicating that the latest version of the metadata is on the consistency node), and the latest version can be directly returned to the client.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种分布式元数据管理一致性保证方法、装置、系统及应用,所述装置在元数据集群中部署一致性节点,客户端向一致性节点发送元数据更新请求,一致性节点向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,将旧元数据标记为作废,且和元数据服务器进行异步的数据同步后作废标记;客户端元数据服务器发送元数据读操作,如果元数据读操作的对象被标记为作废,则通过一致性节点返回尚未完成异步的数据同步的读数据;否则,通过元数据所在的元数据服务器直接返回读数据。所述方法、装置、系统及应用能够保证分布式元数据管理的一致性,在保证元数据更新的一致性的同时,尽可能提升元数据访问性能。

Description

分布式元数据管理一致性保证方法、装置、系统及应用 【技术领域】
本发明涉及大规模数据存储的文件系统领域,具体涉及一种分布式元数据管理一致性保证方法、装置、系统及应用,用于通过在分布式文件系统的元数据集群中引入一致性节点,从而以很低的开销保证元数据一致性。
【背景技术】
随着人类社会积累的数据量不断增长,存储系统的规模越来越大,构建可扩展的存储系统逐步受到学术界和工业界的重视。分布式文件系统作为实现可扩展存储的一种重要手段已引起广泛的关注。传统的分布式文件系统如GFS、Lustre、GPFS、PVFS等均采用一个元数据服务器,这种方式集中式的元数据管理方法成为分布式文件系统实现性能扩展的主要障碍。近年来,多种文件系统推出分布式元数据管理手段。例如,改进版本的Lustre能够支持多个元数据服务器,且采用子树划分的方式将文件系统目录树分布到各个元数据服务器上;CephFS以子树划分为基础,还能在各个元数据服务器之间实现子树的动态迁移;GlusterFS则采用无中心架构,文件系统集群中任意节点都可用作元数据服务器。以上各种方法为实现可扩展的元数据服务提供了解决方案,但很少关注多个元数据服务器之间的数据一致性问题。
文件系统元数据的一致性一直是学术界长期关注的话题。大部分元数据更新操作的背后都涉及多次写磁盘操作,这些写操作在执行过程中随时会出现系统故障。发生系统故障时,如果只有部分写操作被持久化到磁盘中,在系统重启后则会出现数据不一致的情况。针对这一问题,当前的文件系统一般采用日志机制保证元数据一致性。即:上层应用请求文件系统更新元数据时,先将该元数据更新对应的所有写操作组装成一个事务,然后将该事务封装到一个唯一的写操作中,这个唯一的写操作可以原子执行,若其成功地持久化到存储设备上,则对应的元数据更新最终可体现在文件系统中;若该写操作在执行过程中出现系统故障,则对应的元数据更新最终会失败,不会对文件系统产生任何改变。总之,日志机制能够保证对文件系统的元数据更新要么完全成功执行,要么执行失败不会对文件系统产生任何影响,不会导致文件系统元数据出现不一致的情况。
在分布式文件系统环境下,元数据更新操作在客户端提交,由远程的元数据服务器具体执行。在此过程中,客户端和元数据服务器端任何一个节点出现故障都可能导致元数据 不一致。针对这一场景,客户端一般先将更新操作发送到元数据服务器端,元数据服务器接收到更新请求后将更新日志持久化到存储设备上,然后向客户端发送确认消息,客户端收到确认消息后即可认定该更新操作可以成功执行。当分布式文件系统只有一个元数据服务器时,以上机制能够在两次网络交互后实现元数据的一致更新。然而,当前的分布式文件系统为了提高可扩展性更倾向于采用多个元数据服务器,此时为了保证元数据更新的一致性,一般采用“两阶段提交”协议。“两阶段提交”协议广泛地用于保证分布式事务的一致性,具体到分布式文件系统中,当客户端提出元数据更新请求时,可能需要更新多个元数据服务器上的文件元数据。“两阶段提交”协议的具体做法是:客户端分别向每个涉及的元数据服务器发出更新请求,这些元数据服务器接收到更新请求后以日志形式将更新操作持久化到各自的存储设备上,然后分别通知客户端日志持久化已经在元数据服务器上完成,客户端接收到所有元数据服务器发来的日志持久化成功消息后,判定该更新操作可以在所有涉及的元数据服务器上成功执行,随之通知所有元数据服务器可以真正执行元数据更新操作,最后元数据服务器还需要向客户端确认收到真正执行元数据更新操作的指令。以上过程能够在多个元数据服务器的场景下保证元数据更新的一致性,但很难保证元数据更新的性能。首先,“两阶段提交”协议涉及多次网络交互,这将显著增加元数据更新的延迟;其次,一个元数据更新操作可能涉及多个元数据服务器,只有所有元数据服务器上的日志持久化操作都完成以后,整个元数据更新操作才算完成,因此,一旦某个元数据服务器因为负载较高而响应较慢,将会增加整个元数据更新操作的延迟。总之,在分布式元数据管理场景下,现有的技术很难同时兼顾元数据更新的一致性和高性能。
【发明内容】
本发明要解决的技术问题:针对现有技术的上述问题,提供一种分布式元数据管理一致性保证方法、装置、系统及应用,本发明能够保证分布式元数据管理的一致性,在保证元数据更新的一致性的同时,尽可能提升元数据访问性能。
为了解决上述技术问题,本发明采用的技术方案为:
一种分布式元数据管理一致性保证方法,实施步骤包括:
1)截获来自客户端的元数据操作请求,如果元数据操作请求为元数据更新操作,则跳转执行步骤2);否则如果元数据操作请求为针对已标记作废元数据的元数据读操作,则跳转执行步骤3);
2)向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,并将元数据所在元数据服务器中存储的旧元数据标记为作废;并将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;退出;
3)向客户端返回尚未完成异步同步的元数据,退出。
优选地,步骤2)的详细步骤包括:
2.1)向客户端返回元数据更新成功的消息;
2.2)将元数据更新操作封装到一个日志中,并以一个原子写操作将日志持久化到存储设备上,使得元数据更新请求至此已经在保证一致性的条件下持久化到元数据集群中;
2.3)向元数据所在的元数据服务器发送作废消息,将元数据所在元数据服务器中存储的旧元数据标记为作废;
2.4)周期性地将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;退出。
本发明还提供一种分布式元数据管理一致性保证装置,包括:
操作请求判断程序单元,用于截获来自客户端的元数据操作请求,如果元数据操作请求为元数据更新操作,则跳转执行更新操作处理程序单元;否则如果元数据操作请求为元数据读操作,则跳转执行读操作处理程序单元;
更新操作处理程序单元,用于向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,并将元数据所在元数据服务器中存储的旧元数据标记为作废;并将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;
读操作处理程序单元,用于向客户端返回尚未完成异步同步的元数据。
本发明还提供一种分布式元数据管理一致性保证装置,包括至少一台计算机设备构成的一致性保证节点,所述一致性保证节点被编程以执行本发明前述分布式元数据管理一致性保证方法的步骤。
本发明还提供一种分布式元数据管理一致性保证系统,包括客户端和至少一台元数据服务器,所述客户端和元数据服务器之间连接有本发明前述的一致性保证节点。
本发明还提供一种前述分布式元数据管理一致性保证系统的应用方法,实施步骤包括:
S1)客户端判断其待发起的元数据操作请求的类型,如果元数据操作请求为元数据更新操作,则跳转执行步骤S2);否则如果元数据操作请求为元数据读操作,则跳转执行步骤S3);
S2)客户端选择一个一致性保证节点,并向选择的一致性保证节点发送元数据更新操作,在收到一致性保证节点返回元数据更新成功的消息后,结束并退出;
S3)客户端向元数据读操作的目标元数据服务器发送元数据读操作;
S4)目标元数据服务器判断元数据读操作的目标元数据是否存在作废标记,如果目标元数据不存在作废标记,则向客户端返回客户端目标元数据,客户端在收到返回的目标元数据后结束并退出;如果目标元数据存在作废标记,则向客户端返回目标元数据作废消息;客户端在收到返回的目标元数据作废消息后,跳转执行步骤S5);
S5)客户端选择一个一致性保证节点,并向选择的一致性保证节点发送针对已标记作废元数据的元数据读操作,且在收到一致性保证节点向客户端返回的尚未完成异步同步的元数据后,结束并退出。
优选地,步骤S2)以及步骤S5)客户端选择一个一致性保证节点时,具体是根据元数据更新操作或者元数据读操作的文件名选择对应的一致性节点。
优选地,所述根据元数据更新操作或者元数据读操作的文件名选择对应的一致性节点具体是指将元数据更新操作或者元数据读操作的文件名filename采用哈希函数h(x)选择编号为h(filename)%N的一致性节点,其中%为求余符号,N为元数据集群中一致性节点的数目。
和现有技术相比,本发明具有下述优点:
1、在元数据更新的关键路径上只有一次网络交互,保证元数据更新具有较低的延迟。在本发明的技术方案下,元数据更新时无论涉及几个元数据服务器,客户端在元数据更新的关键路径上仅需与一个一致性节点发生一次网络交互。该网络交互完成后,最新的元数据已经持久化到一致性节点上,客户端可以转向执行其它任务。相比之下,对于传统的文件系统,如果元数据更新涉及多个元数据服务器,客户端需要与这些元数据服务器分别交互,且必须等待所有交互都完成才能认定元数据更新已经完成,任何一个元数据服务器出现滞后都会增加元数据更新延迟。
2、在元数据更新的关键路径上只有一次写操作,保证元数据更新具有较低的延迟。在本发明的技术方案下,元数据更新无论涉及几个元数据服务器,客户端在元数据更新的关键路径上仅需向一个一致性节点发出一个写请求,即:一致性节点将元数据更新操作封装成日志后,以一个原子操作写到存储设备上,且该写操作以顺序写的方式提交到存储设备,能够最大程度的发挥存储设备的效能。相比之下,对于传统的文件系统,如果元数据更新涉及多个元数据服务器,客户端需要向这些元数据服务器分别提交写请求,任何一个写请求没有完成,该元数据更新就不能确认完成。
3、本发明能够实现元数据读写分流,在一定程度上降低元数据服务器的突发负载,同时充分发挥固态盘的性能。在本发明的技术方案下,元数据更新操作首先发向一致性节点、后期异步更新到元数据服务器上,元数据获取操作则直接发向元数据服务器。当元数 据服务器上的读负载较重时,一致性节点上的最新数据可以暂缓更新到元数据服务器上,这种策略在一定程度上实现了元数据的读写分流。读写分流至少带来两方面的优势:①在读负载很重时推迟写负载,降低元数据服务器的负载峰值;②在固态盘上实现读写分流才能充分发挥固态盘的并发优势,本发明有助于提升以固态盘为存储设备的元数据服务器的性能。
【附图说明】
图1为本发明实施例中的元数据集群结构示意图。
图2为本发明实施例方法的基本流程示意图。
图3为本发明实施例中的元数据更新流程示意图。
图4为本发明实施例中的元数据读取流程示意图。
【具体实施方式】
传统的大规模文件系统由三类节点组成:客户端、元数据服务器、数据服务器。其中,客户端是所有读写请求的发起端,元数据服务器负责存储文件元数据、响应客户端发出的元数据请求,数据服务器负责存储文件数据、响应客户端发出的数据请求。本实施例在传统的大规模文件系统的以上基础架构上引入一类新的节点:一致性节点(Consistent Node),用于执行元数据管理一致性保证,具体架构如图1所示。图1中列举了一个包含4个客户端、3个元数据服务器、2个一致性节点的文件系统组织架构,其中脏表示脏数据,数据服务器因为不涉及本发明的相关技术而被忽略。在以上架构中,客户端与元数据服务器的功能与传统文件系统相同,一致性节点和元数据服务器共同组成元数据集群,其中一致性节点则主要用于在元数据更新时以较低的开销保证元数据一致性。
如图2所示,本实施例分布式元数据管理一致性保证方法的实施步骤包括:
1)截获来自客户端的元数据操作请求,如果元数据操作请求为元数据更新操作,则跳转执行步骤2);否则如果元数据操作请求为针对已标记作废元数据的元数据读操作,则跳转执行步骤3);
2)向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,并将元数据所在元数据服务器中存储的旧元数据标记为作废;并将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;退出;
3)向客户端返回尚未完成异步同步的元数据,退出。需要说明的是,本实施例中客户端的元数据读操作优先发往元数据服务器,只有在元数据服务器确认元数据标记为作废的情况下,客户端才会再将元数据读操作发往一致性节点。
本实施例分布式元数据管理一致性保证方法首先提出了以上引入一致性节点的大规 模文件系统架构,以此为基础设计一套新的元数据读写流程,同时兼顾一致性和高性能。
如图3所示,步骤2)的详细步骤包括:
2.1)向客户端返回元数据更新成功的消息;
2.2)将元数据更新操作封装到一个日志中,并以一个原子写操作将日志持久化到存储设备上,使得元数据更新请求至此已经在保证一致性的条件下持久化到元数据集群中;
2.3)向元数据所在的元数据服务器发送作废消息,将元数据所在元数据服务器中存储的旧元数据标记为作废(此过程无需读写存储设备);
2.4)周期性地将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;退出。
对应地,本实施例还提供一种分布式元数据管理一致性保证装置,包括:
操作请求判断程序单元,用于截获来自客户端的元数据操作请求,如果元数据操作请求为元数据更新操作,则跳转执行更新操作处理程序单元;否则如果元数据操作请求为元数据读操作,则跳转执行读操作处理程序单元;
更新操作处理程序单元,用于向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,并将元数据所在元数据服务器中存储的旧元数据标记为作废;并将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;
读操作处理程序单元,用于向客户端返回尚未完成异步同步的元数据。
如图2所示,本实施例还提供一种分布式元数据管理一致性保证装置,包括至少一台计算机设备构成的一致性保证节点,该一致性保证节点被编程以执行本实施例前述分布式元数据管理一致性保证方法的步骤。
如图3和图4所示,此外本实施例还提供一种前述分布式元数据管理一致性保证系统的应用方法,实施步骤包括:
S1)客户端判断其待发起的元数据操作请求的类型,如果元数据操作请求为元数据更新操作,则跳转执行步骤S2);否则如果元数据操作请求为元数据读操作,则跳转执行步骤S3);
S2)客户端选择一个一致性保证节点,并向选择的一致性保证节点发送元数据更新操作,在收到一致性保证节点返回元数据更新成功的消息后,结束并退出;
S3)客户端向元数据读操作的目标元数据服务器发送元数据读操作;
S4)目标元数据服务器判断元数据读操作的目标元数据是否存在作废标记,如果目标元数据不存在作废标记,则向客户端返回客户端目标元数据,客户端在收到返回的目标元 数据后结束并退出;如果目标元数据存在作废标记,则向客户端返回目标元数据作废消息;客户端在收到返回的目标元数据作废消息后,跳转执行步骤S5);
S5)客户端选择一个一致性保证节点,并向选择的一致性保证节点发送针对已标记作废元数据的元数据读操作,且在收到一致性保证节点向客户端返回的尚未完成异步同步的元数据后,结束并退出。
本实施例中,步骤S2)以及步骤S5)客户端选择一个一致性保证节点时,具体是根据元数据更新操作或者元数据读操作的文件名选择对应的一致性节点,通过上述方式能够实现多一致性节点的负载均衡。
本实施例中,根据元数据更新操作或者元数据读操作的文件名选择对应的一致性节点具体是指将元数据更新操作或者元数据读操作的文件名filename采用哈希函数h(x)选择编号为h(filename)%N的一致性节点,其中%为求余符号,N为元数据集群中一致性节点的数目。采用这种方式的优势是:当客户端读取元数据集群发现所请求的元数据在一致性节点上、但还没有同步到元数据服务器上时,同样可以通过对文件名作哈希计算确定最新更新数据所在的一致性节点,进而获取最新的元数据。这种仅仅通过文件名定位最新元数据的方式不会带来额外的存储和IO开销。
本实施例中,客户端的应用程序通过系统调用发出元数据更新请求,该系统调用通过虚拟文件系统内嵌到本发明所设计的分布式文件系统的客户端上。
在本实施例设计的包含一致性节点的分布式文件系统中,当客户端发出元数据更新请求时,首先向一致性节点发送更新日志(相比之下,传统的分布式文件系统由客户端直接向元数据服务器发送更新请求)。一致性节点接收到更新日志后迅速持久化到本地的存储设备中,一旦持久化操作成功,即可向客户端返回元数据更新操作成功的确认消息。以上元数据更新过程存在两点特性保证元数据更新能够取得较高的性能:首先,无论该元数据更新操作涉及多少个元数据服务器,客户端仅需与一个一致性节点交互,且只需一次网络交互,网络延迟显著降低;其次,一致性节点上的持久化操作为日志的顺序写,顺序写操作在所有存储设备上都能取得较好的性能,进一步降低元数据更新持久化的延迟。至此,客户端发出的元数据更新仅仅体现在一致性节点上,还没有提交到元数据服务器上。但是,从整个元数据集群的角度看,更新状态已经在保证一致性和持久性的条件下被记录下来,剩下的工作是在元数据集群内部实现一致性节点和元数据服务器之间的数据同步。由于一致性节点已经告知客户端所更新的元数据已被持久化到元数据集群中,客户端不必等待更新的元数据从一致性节点同步到元数据服务器,可以直接转向执行其它任务。因此,从一致性节点到元数据服务器的数据同步不在IO的关键路径上,完全可以异步执行。本实施 例选择在元数据服务器负载相对较轻时将一致性节点上的更新日志提交到元数据服务器上。由于一致性节点到元数据服务器的数据同步是异步执行的,元数据服务器的更新状态相比一致性节点稍显滞后。在一致性节点和元数据服务器不同步的一段时间内,客户端读取元数据服务器时不能获取最新的元数据。为了减少异步数据更新带来的负面影响,本实施例将元数据更新日志持久化到一致性节点之后,由一致性节点立即向元数据服务器发出通知,告知该元数据更新已经持久化到一致性节点上,但会滞后同步到元数据服务器上。以上通知能够通过一次网络交互完成,且元数据服务器收到该通知后只需在内存中记录这一信息,无需读写存储设备,因此不会带来过大的开销。本实施例中由于针对顺序记录的元数据更新请求与元数据所在的元数据服务器进行异步的数据同步,因此可以在元数据服务器相对空闲时同步到元数据服务器上。
本实施例设计的元数据集群包含一致性节点和元数据服务器,且这两类节点之间的数据可能是不同步的,但是数据不同步只可能发生在最近更新的元数据上,大部分元数据的最新版本仍然保存在元数据服务器上。因此,客户端发起元数据读请求时仍然首先与元数据服务器交互。在元数据服务器的内存中,有一些标记用来指明哪些元数据已经持久化到一致性节点上、但还没有同步到元数据服务器中。如果客户端请求的元数据属于这一类,元数据服务器则主动从一致性节点获取最新的元数据返回给客户端,并同时将自身保存的元数据更新到最新状态;如果元数据服务器保存的元数据本身就是最新版本(即:内存中没有标记指明该元数据的最新版本在一致性节点上),则可直接将最新版本返回给客户端。
以上所述仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (8)

  1. 一种分布式元数据管理一致性保证方法,其特征在于,实施步骤包括:
    1)截获来自客户端的元数据操作请求,如果元数据操作请求为元数据更新操作,则跳转执行步骤2);否则如果元数据操作请求为针对已标记作废元数据的元数据读操作,则跳转执行步骤3);
    2)向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,并将元数据所在元数据服务器中存储的旧元数据标记为作废;并将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;退出;
    3)向客户端返回尚未完成异步同步的元数据,退出。
  2. 根据权利要求1所述的分布式元数据管理一致性保证方法,其特征在于,步骤2)的详细步骤包括:
    2.1)向客户端返回元数据更新成功的消息;
    2.2)将元数据更新操作封装到一个日志中,并以一个原子写操作将日志持久化到存储设备上,使得元数据更新请求至此已经在保证一致性的条件下持久化到元数据集群中;
    2.3)向元数据所在的元数据服务器发送作废消息,将元数据所在元数据服务器中存储的旧元数据标记为作废;
    2.4)周期性地将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;退出。
  3. 一种分布式元数据管理一致性保证装置,其特征在于,包括:
    操作请求判断程序单元,用于截获来自客户端的元数据操作请求,如果元数据操作请求为元数据更新操作,则跳转执行更新操作处理程序单元;否则如果元数据操作请求为元数据读操作,则跳转执行读操作处理程序单元;
    更新操作处理程序单元,用于向客户端返回元数据更新成功的消息并顺序记录元数据更新请求,并将元数据所在元数据服务器中存储的旧元数据标记为作废;并将顺序记录的元数据更新请求异步同步到元数据所在的元数据服务器并删除完成同步的元数据的作废标记;
    读操作处理程序单元,用于向客户端返回尚未完成异步同步的元数据。
  4. 一种分布式元数据管理一致性保证装置,包括至少一台计算机设备构成的一致性保证节点,其特征在于,所述一致性保证节点被编程以执行权利要求1或2所述分布式元数 据管理一致性保证方法的步骤。
  5. 一种分布式元数据管理一致性保证系统,包括客户端和至少一台元数据服务器,其特征在于,所述客户端和元数据服务器之间连接有权利要求4所述的一致性保证节点。
  6. 一种权利要求5所述分布式元数据管理一致性保证系统的应用方法,其特征在于,实施步骤包括:
    S1)客户端判断其待发起的元数据操作请求的类型,如果元数据操作请求为元数据更新操作,则跳转执行步骤S2);否则如果元数据操作请求为元数据读操作,则跳转执行步骤S3);
    S2)客户端选择一个一致性保证节点,并向选择的一致性保证节点发送元数据更新操作,在收到一致性保证节点返回元数据更新成功的消息后,结束并退出;
    S3)客户端向元数据读操作的目标元数据服务器发送元数据读操作;
    S4)目标元数据服务器判断元数据读操作的目标元数据是否存在作废标记,如果目标元数据不存在作废标记,则向客户端返回客户端目标元数据,客户端在收到返回的目标元数据后结束并退出;如果目标元数据存在作废标记,则向客户端返回目标元数据作废消息;客户端在收到返回的目标元数据作废消息后,跳转执行步骤S5);
    S5)客户端选择一个一致性保证节点,并向选择的一致性保证节点发送针对已标记作废元数据的元数据读操作,且在收到一致性保证节点向客户端返回的尚未完成异步同步的元数据后,结束并退出。
  7. 根据权利要求6所述分布式元数据管理一致性保证系统的应用方法,其特征在于,步骤S2)以及步骤S5)客户端选择一个一致性保证节点时,具体是根据元数据更新操作或者元数据读操作的文件名选择对应的一致性节点。
  8. 根据权利要求7所述的基于一致性节点的分布式元数据管理一致性保证方法,其特征在于,所述根据元数据更新操作或者元数据读操作的文件名选择对应的一致性节点具体是指将元数据更新操作或者元数据读操作的文件名filename采用哈希函数h(x)选择编号为h(filename)%N的一致性节点,其中%为求余符号,N为元数据集群中一致性节点的数目。
PCT/CN2019/078843 2018-12-13 2019-03-20 分布式元数据管理一致性保证方法、装置、系统及应用 WO2020118955A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/274,434 US20220050809A1 (en) 2018-12-13 2019-03-20 Distributed metadata management consistency assurance method, device, system and application

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811525115.7A CN109582686B (zh) 2018-12-13 2018-12-13 分布式元数据管理一致性保证方法、装置、系统及应用
CN201811525115.7 2018-12-13

Publications (1)

Publication Number Publication Date
WO2020118955A1 true WO2020118955A1 (zh) 2020-06-18

Family

ID=65928446

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/078843 WO2020118955A1 (zh) 2018-12-13 2019-03-20 分布式元数据管理一致性保证方法、装置、系统及应用

Country Status (3)

Country Link
US (1) US20220050809A1 (zh)
CN (1) CN109582686B (zh)
WO (1) WO2020118955A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553308A (zh) * 2021-06-21 2021-10-26 深圳市证通云计算有限公司 一种基于CephFS分布式文件系统的多集群多活实现方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347481A (zh) * 2019-07-17 2019-10-18 北京搜狐新媒体信息技术有限公司 一种实现分布式事务的方法及系统
CN112799835A (zh) * 2021-01-27 2021-05-14 浪潮云信息技术股份公司 一种分布式数据库系统元数据的处理方法及系统
CN113687921A (zh) * 2021-10-25 2021-11-23 北京金山云网络技术有限公司 事务处理方法、装置、分布式数据库系统及电子设备
CN115297130B (zh) * 2022-10-09 2022-12-09 四川蜀天信息技术有限公司 一种基于区块链的分布式存储网络数据同步方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101188566A (zh) * 2007-12-13 2008-05-28 沈阳东软软件股份有限公司 一种集群环境下数据缓存同步的方法及系统
CN103152398A (zh) * 2013-02-07 2013-06-12 中国科学院计算技术研究所 一种多终端文件间的文件同步方法及系统
US20130291066A1 (en) * 2012-04-25 2013-10-31 Hitachi, Ltd. Method and apparatus to keep consistency of acls among a meta data server and data servers
CN106021468A (zh) * 2016-05-17 2016-10-12 上海携程商务有限公司 分布式缓存和本地缓存的更新方法和系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130218934A1 (en) * 2012-02-17 2013-08-22 Hitachi, Ltd. Method for directory entries split and merge in distributed file system
CN104113597B (zh) * 2014-07-18 2016-06-08 西安交通大学 一种多数据中心的hdfs数据读写方法
CN104484470B (zh) * 2014-12-31 2018-06-08 天津南大通用数据技术股份有限公司 一种数据库集群元数据管理方法
CN106034160B (zh) * 2015-03-19 2019-06-11 阿里巴巴集团控股有限公司 分布式计算系统和方法
CN105187517A (zh) * 2015-08-20 2015-12-23 浪潮(北京)电子信息产业有限公司 一种实现元数据集群的方法及元数据集群
CN107786650B (zh) * 2017-10-20 2021-06-18 郑州云海信息技术有限公司 一种存储介质和管理策略的同步方法、装置及系统
CN108848180A (zh) * 2018-06-27 2018-11-20 郑州云海信息技术有限公司 一种元数据同步方法、装置、设备及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101188566A (zh) * 2007-12-13 2008-05-28 沈阳东软软件股份有限公司 一种集群环境下数据缓存同步的方法及系统
US20130291066A1 (en) * 2012-04-25 2013-10-31 Hitachi, Ltd. Method and apparatus to keep consistency of acls among a meta data server and data servers
CN103152398A (zh) * 2013-02-07 2013-06-12 中国科学院计算技术研究所 一种多终端文件间的文件同步方法及系统
CN106021468A (zh) * 2016-05-17 2016-10-12 上海携程商务有限公司 分布式缓存和本地缓存的更新方法和系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553308A (zh) * 2021-06-21 2021-10-26 深圳市证通云计算有限公司 一种基于CephFS分布式文件系统的多集群多活实现方法及系统

Also Published As

Publication number Publication date
CN109582686B (zh) 2021-01-15
US20220050809A1 (en) 2022-02-17
CN109582686A (zh) 2019-04-05

Similar Documents

Publication Publication Date Title
WO2020118955A1 (zh) 分布式元数据管理一致性保证方法、装置、系统及应用
US7653668B1 (en) Fault tolerant multi-stage data replication with relaxed coherency guarantees
US20230012697A1 (en) Methods, devices and systems for maintaining consistency of metadata and data across data centers
Borthakur The hadoop distributed file system: Architecture and design
Borthakur HDFS architecture guide
US8793531B2 (en) Recovery and replication of a flash memory-based object store
JP6404907B2 (ja) 効率的な読み取り用レプリカ
JP5714571B2 (ja) キャッシュクラスタを構成可能モードで用いるキャッシュデータ処理
US7487311B2 (en) System and method for asynchronous backup of virtual disks in a distributed storage array
JP2019036353A (ja) 索引更新パイプライン
US7440977B2 (en) Recovery method using extendible hashing-based cluster logs in shared-nothing spatial database cluster
Borthakur HDFS architecture
US20110283045A1 (en) Event processing in a flash memory-based object store
US20120005154A1 (en) Efficient recovery of transactional data stores
JP6225262B2 (ja) 分散データグリッドにおいてデータを同期させるためにパーティションレベルジャーナリングをサポートするためのシステムおよび方法
CN103516549B (zh) 一种基于共享对象存储的文件系统元数据日志机制
JP2016524750A5 (zh)
KR20150130496A (ko) 분산 데이터베이스 시스템들을 위한 고속 장애 복구
CN112052230B (zh) 多机房数据同步方法、计算设备及存储介质
CN110807039A (zh) 一种云计算环境下数据一致性维护系统及方法
US20230205638A1 (en) Active-active storage system and data processing method thereof
CN105323271B (zh) 一种云计算系统以及云计算系统的处理方法和装置
CN104991739A (zh) 元数据服务器失效接替中精确一次执行语义的方法及系统
CN115658245B (zh) 一种基于分布式数据库系统的事务提交系统、方法及装置
WO2024051027A1 (zh) 一种大数据的数据配置方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19896756

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19896756

Country of ref document: EP

Kind code of ref document: A1