WO2018058949A1 - Data storage method, device and system - Google Patents

Data storage method, device and system Download PDF

Info

Publication number
WO2018058949A1
WO2018058949A1 PCT/CN2017/082141 CN2017082141W WO2018058949A1 WO 2018058949 A1 WO2018058949 A1 WO 2018058949A1 CN 2017082141 W CN2017082141 W CN 2017082141W WO 2018058949 A1 WO2018058949 A1 WO 2018058949A1
Authority
WO
WIPO (PCT)
Prior art keywords
subtree
node
path
data node
data
Prior art date
Application number
PCT/CN2017/082141
Other languages
French (fr)
Chinese (zh)
Inventor
任永强
谢晓芹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018058949A1 publication Critical patent/WO2018058949A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a data storage method, apparatus, and system.
  • a distributed file system is a file system network that is formed by extending a file system fixed to a node to any number of nodes/file systems and connecting them through a plurality of nodes. .
  • the system usually includes the following three types of nodes: Protocol Server, Meta Data Server (MDS), and Data Server (DS).
  • MDS Meta Data Server
  • DS Data Server
  • the distributed file system has various ways of dividing metadata services, such as a dynamic subtree based approach. In a dynamic subtree-based manner, each node maintains one or more subtrees, which is responsible for a portion of the metadata service.
  • the receiver collects the subtree path information and the affiliation relationship from all the nodes in the system, and deduplicates the collected subtree information to regenerate the subtree information on the faulty MDS, that is, the faulty MDS. reconstruction.
  • the receiver needs to send a request to all the MDSs in the system, so that when the cluster is large during the takeover process and the number of sent and received messages of the node is large, a single point bottleneck is easily caused, resulting in a long takeover time.
  • the method requires that the sub-tree information of each MDS has a high consistency. If the sub-tree information of each MDS is inconsistent, the receiver's reconstruction algorithm is likely to be complicated.
  • the embodiment of the invention provides a data storage method, device and system, which can solve the problem that the subtree takes over a long time when a data node fails, and reduces the consistency requirement of the subtree information of the data node.
  • an embodiment of the present invention provides a data storage method, including:
  • Determining a subtree path in the distributed file system obtaining node attribution information on the subtree path, and storing node attribution information on the subtree path.
  • the subtree path may include a root node (a subtree root node, referred to as a “subtree root”) of a subtree in each data node in the distributed file system, and a root node of the subtree of each data node to the system.
  • the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree. Therefore, operations such as subtree migration or subtree takeover can be performed based on the stored node attribution information, thereby reducing consistency requirements for subtree information in the system.
  • the storing the node attribution information on the subtree path may be specifically: generating a subtree path attribution table including the node attribution information on the subtree path, and storing the subtree path attribution table. If the subtree path attribution table can be written to the disk for persistent storage.
  • connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  • the subtree path attribution table may further include subtree root node information and the like on the subtree path.
  • the index node number of the target subtree may be obtained; and the subtree path attribution table is found out An index node number with the same index node number of the target subtree; and the home data node corresponding to the found index node number in the subtree path attribution table is updated to the second data node.
  • the index node number of the target subtree may specifically refer to the target subtree.
  • a subtree takeover request indicating that the third data node is faulty may also be received; and in response to the subtree takeover request, the belonging of the subtree belonging to the third data node in the subtree path attribution table Updating the data node to a fourth data node; and returning, to the fourth data node, subtree information belonging to the third data node in the subtree path attribution table before updating, so that the fourth data node is in the Subtree reconstruction is performed in the data cache of the four data nodes.
  • the subtree takeover can be performed based on the stored node attribution information, thereby solving the problem that the subtree takeover time is long when the data node is in an MDS fault.
  • the embodiment of the present invention further provides a data storage system, including: a first data node and a central node; wherein
  • a first data node configured to provide a data service for a subtree of the first data node
  • a central node configured to determine a subtree path in the distributed file system where the first data node is located; acquire node attribution information on the subtree path, and store node attribution information on the subtree path.
  • the subtree path may include a root node of a subtree in each data node in the distributed file system and a subtree node on a path of a root node of the subtree of each data node to a root node of the system.
  • the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree.
  • the central node may be any data node in the distributed file system, such as an MDS, that is, an identity is superimposed on the data node, and the function still provides a metadata service; or is separately set in the system.
  • the node of the present invention is not limited.
  • the central node may be further configured to generate a subtree path attribution table including node attribution information on the subtree path, and store the subtree path attribution table.
  • connection information between the subtrees on the subtree path may include an inode number of each subtree node on the subtree path and an inode number of a previous subtree node of each subtree node.
  • the subtree path attribution table may further include subtree root node information on the subtree path, a subtree node name associated with the inode number, and the like.
  • system may further include: a second data node; wherein
  • the first data node may be further configured to send a migration notification message to the central node, where the migration notification message indicates an inode number of the target subtree to be migrated and a second data node to be migrated to;
  • the central node is further configured to: find, from the subtree path attribution table, an index node number that is the same as an index node number of the target subtree; and corresponding to the found index node number in the subtree path attribution table.
  • the home data node is updated to the second data node.
  • the first data node is a data node that needs to perform subtree migration
  • the second data node is a determined data node that needs to be migrated.
  • the second data node that needs to be migrated may be selected according to a preset rule, such as a data node with the lowest heat, or may be randomly selected, which is not limited in the embodiment of the present invention.
  • subtree migration can be implemented based on the stored node attribution information.
  • the second data node may be further configured to construct a data cache, and update a home data node of the target subtree in the cache to the second data node;
  • the first data node is further configured to update a home data node of the target subtree in a data cache of the first data node to the second data node.
  • the second data node may also construct a data cache. After updating the subtree path attribution table, the second data node may change the home data node of the target subtree in the cache to itself. A data node can change the home data node of the target subtree in the cache to the second data node. In order to quickly extract subtree information.
  • the system may include: a third data node that is a failed data node, and a fourth data node that determines the takeover of the third data The data node of the subtree of the node;
  • the fourth data node is configured to send, to the central node, a subtree takeover request for indicating that the third data node is faulty;
  • the central node is configured to receive the subtree takeover request, update the home data node of the subtree belonging to the third data node in the subtree path attribution table to a fourth data node, and return to the fourth data node Subtree information attributed to the third data node in the subtree path attribution table before the update;
  • the fourth data node is further configured to perform subtree reconstruction in a data cache of the fourth data node based on the subtree information of the third data node.
  • the central node, the first data node, the second data node, the third data node, the fourth data node, and the like may be the same data node or a different node, which is not limited in the embodiment of the present invention. Therefore, the subtree migration and the subtree takeover can be performed based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node is faulty, such as the MDS, and reduces the consistency requirement of the subtree information in the system. .
  • the embodiment of the present invention further provides a data storage device, including: a path determining module, a first obtaining module, and a storage module, wherein the data storage device can implement the part of the data storage method of the first aspect by using the foregoing module Or all the steps.
  • an embodiment of the present invention further provides a computer storage medium, where the computer storage medium stores a program, and the program includes some or all of the steps of the data storage method of the first aspect.
  • an embodiment of the present invention further provides a data server, including: a memory and a processor, where the memory is connected to the processor;
  • the memory is used to store driver software
  • the processor is configured to read the driver software from the memory and execute the function of the driver software:
  • node attribution information on the subtree path where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree;
  • the node attribution information on the subtree path is stored.
  • the processor is further configured to read the driver software from the memory and perform some or all of the steps of the data storage method of the first aspect described above by the driver software.
  • the node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulted based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node fails, and reduces the subnode of the data node. Tree information consistency requirements.
  • FIG. 1 is a structural diagram of a distributed file system according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a data storage method according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a subtree grouping according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a subtree path according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of interaction of a data storage method according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a subtree migration according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of another seed tree path according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of interaction of another data storage method according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of another seed tree path according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a data storage device according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a data storage system according to an embodiment of the present invention.
  • FIG. 12 is a schematic structural diagram of a data server according to an embodiment of the present invention.
  • a protocol server In a distributed file system, it usually includes three types of nodes: a protocol server (Protocol Server), an MDS, and a DS. A plurality of nodes are connected to form a cluster.
  • the protocol server is responsible for providing the user with a standard Network File System (NFS) and SMB (Server Message Block) network file service function.
  • MDS is responsible for providing metadata related services such as parsing paths, finding files, and creating files.
  • the MDS is responsible for storing the data of the file.
  • the user program can access the file system through a protocol client.
  • the protocol client and the protocol server are connected through a front-end network (FE), and each node in the file system is connected through a back-end network (BE).
  • FE front-end network
  • BE back-end network
  • the front-end network is used for request and data interaction between the user service and the distributed file system
  • the back-end network is used for request and data interaction between the various node devices in the distributed file system.
  • the three types of nodes may be logical nodes, and may be deployed on the same physical device or may be deployed separately, which is not limited in the embodiment of the present invention.
  • the entire system's metadata service can be distributed to each MDS (or DS), and each MDS (or DS) is responsible for a part of the metadata service.
  • the system can continue to provide metadata services when one or several MDSs (or DSs) fail, thereby improving system reliability.
  • the embodiment of the present invention is described by taking a data node as an MDS as an example.
  • the distributed file system divides the metadata service, that is, the manner of dividing the MDS, for example, based on a dynamic subtree.
  • the system can count the heat of each file, directory (or directory fragment) and directory tree in the cache according to the access type of the service (such as getattr, setattr, readdir, etc.) and the access frequency, and the different access types are accumulated. Different heats, and the higher the frequency of access, the higher the heat. As the heat increases, the entire tree can be divided into multiple subtrees and migrated to different MDSs. Each MDS is responsible for providing metadata services belonging to its own subtree.
  • the access type of the service such as getattr, setattr, readdir, etc.
  • each MDS only needs to cache the metadata related to itself, including the part of the subtree metadata belonging to itself (used to provide metadata service), the boundary point of the subtree and the subordinate subtree (downward access) It is possible to know where the metadata originated from another MDS, and the path between the root of the subtree and the root directory (when the node fails, it can still parse the full path).
  • the subtrees can move freely between MDSs and can be split into smaller subtrees or merged into large subtrees.
  • the division and attribution of its subtrees are dynamically adjusted with the access of the business (ie, the heat, ie the load of the nodes), rather than static.
  • the present application introduces a central node to centrally manage subtree information in the entire system through the central node.
  • the central node may be any MDS (or DS) node in the system (ie, an identity is superimposed on the MDS, which still has the function of providing a metadata service as the MDS), or an additional set node, the present invention
  • MDS MDS
  • the embodiment is not limited. Therefore, the problem of large single-point message volume and long takeover time when the sub-tree is taken over in a large cluster is solved, and the takeover logic is also simplified.
  • the embodiment of the invention discloses a data storage method, device, data server and system, which can solve the problem that the subtree takes over a long time when the data node fails, and reduces the consistency requirement of the subtree information of the data node. The details are explained below.
  • FIG. 2 is a schematic flowchart of a data storage method according to an embodiment of the present invention. Specifically, the method of the embodiment of the present invention may be specifically applied to a central node such as an MDS or a DS. As shown in FIG. 2, the method of the embodiment of the present invention may include the following steps:
  • the subtree path of a subtree may refer to the path from the root of the subtree to the root of the system.
  • the subtree path of each subtree in the distributed file system (hereinafter referred to as “system”) may be determined first, that is, the subtree path in the system may include each data node in the system.
  • the connection information between the subtrees on the subtree path may include information such as an inode number of each subtree node and an inode number of a previous subtree node of each subtree node.
  • the index node number is identification information of the subtree node, and each index node number uniquely identifies a subtree node (such as a directory, a file, etc.) in the distributed file system.
  • the storing the node attribution information on the subtree path may be specifically: generating a subtree path attribution table including the node attribution information on the subtree path, and storing the subtree path attribution table . That is, the subtree path attribution table includes subtree paths of all the subtrees in the system to the system root node.
  • the subtree path attribution table may further include subtree root node information on the subtree path. The name of the subtree node associated with the inode number, and so on.
  • FIG. 3 is a schematic diagram of a subtree grouping according to an embodiment of the present invention.
  • the distributed file system includes three data nodes, namely, MDS1, MDS2, and MDS3, and is divided into four subtrees, including subtrees A, B, C, and D.
  • the subtree nodes (/, var, lib, etc, usr, lib, lib.so.6) in subtree A belong to MDS1, subtree nodes in subtree B (log, messages, news, news.
  • the subtree nodes in subtree D (such as vim72, plugin) belong to MDS2, and the subtree nodes (share, bash, helpfiles, kill, cd, vim, site) in subtree C belong to MDS3;
  • / is the root node in the distributed file system, that is, the system root directory;
  • the subtree nodes on the subtree path in the distributed file system are /, var, usr, log, share, vim, vim72, assuming respectively Recorded as ino1, ino2, ino3, ino4, ino5, ino6, ino7 (of course, can also be identified as other forms, here only for example),
  • the subtree node on the subtree path is associated with its corresponding identification information ino (eg "/" associated with ino1, "var” associated with ino2, “usr” associated with ino3, “log” associated with ino4, "share” associated with ino5, "vim” and Ino6 association, "vim72
  • ino is the subtree node on the subtree path
  • dirino is the previous subtree node (parent node) of the subtree node
  • subtree flag indicates the subtree root node (referred to as "subtree root") information, which is "1” ” indicates that the subtree node is the subtree root node of the data node, and “0” indicates that the subtree node is not the subtree root node of the data node; auth represents the home data node of the subtree node.
  • the subtree flag subtree flag can be used as a flag to indicate whether a node on the path is a subtree root (that is, a boundary where the upper and lower nodes belong to change).
  • the subtree root information may not be stored in the subtree path attribution table.
  • the subtree root information can also be deduced from the attribution information, that is, determined according to ino, dirino, and auth.
  • the home data node of ino3 is MDS1
  • the home data node of ino4 is MDS2. If the attribution changes, it can be determined that starting from ino4, it is a subtree, and ino4 is the subtree root of the subtree.
  • each ino recorded in the attribution table is associated with /, var, usr, log, share, vim, vim72, and the ino is part of a file or directory attribute, and is unique in the system, which is equivalent to a file or a directory.
  • Etc. ID Therefore, ino can not be replaced by the name of the subtree node in the attribution table, because there may be multiple files (or directories) with the same name in the system. If there are multiple directories named "var", it is impossible to uniquely determine the Subtree node.
  • the subtree path attribution table may be written to the disk for persistent storage. Further, the content in the subtree path attribution table may be cached in the form of a tree in the memory to quickly access the subtree path and obtain the subtree node information.
  • the subtree migration may be performed based on the stored node attribution information.
  • the central node may acquire an index node number of the target sub-tree; and find and describe the sub-tree path attribution table from the sub-tree path An index node number of the target sub-tree having the same inode number; updating the home data node corresponding to the found index node number in the subtree path attribution table to the second data node.
  • the subtree migration process may be triggered when the node is too hot, or triggered when the system heat allocation adjustment is needed, and so on; the second data node that needs to be migrated may be in the system.
  • Any data node that is selected according to a preset rule, such as the data node with the lowest heat, may also be randomly selected, which is not limited in the embodiment of the present invention.
  • the subtree takeover may be performed based on the stored node attribution information. Specifically, receiving a subtree takeover request indicating that the third data node is faulty; and responding to the subtree takeover request, updating the home data node of the subtree belonging to the third data node in the subtree path attribution table a fourth data node; to the fourth data node Returning the subtree information attributed to the third data node in the subtree path attribution table before the update, so that the fourth data node performs subtree reconstruction in the cache.
  • the third data node is the faulty node
  • the fourth data node is the determined data node for taking over the subtree of the third data node.
  • the fourth data node of the takeover may be selected according to a preset rule, such as a data node with the lowest heat, or may be randomly selected, which is not limited in the embodiment of the present invention.
  • the node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulty based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node is in an MDS failure, and reduces the MDS. Sub-tree information consistency requirements.
  • FIG. 5 is a schematic diagram of interaction of a data storage method according to an embodiment of the present invention.
  • the embodiment of the present invention is applied to a subtree migration scenario between data nodes, that is, the target subtree belonging to the first data node needs to be migrated to the second data node.
  • the subtree D (ie, the target subtree) of the MDS 2 (ie, the first data node) needs to be migrated to the MDS 1 (ie, the second data node), as shown in FIG. 6.
  • the method of the embodiment of the present invention may include the following steps:
  • the MDS2 sends a migration request to the MDS1.
  • the migration request may be triggered when the traffic access of the MDS2 is too hot, such as exceeding a preset heat threshold.
  • the MDS to be migrated to may be determined by any MDS or central node in the distributed file system in which the MDS 2 is located. For example, according to the heat of each MDS, the MDS1 with the lowest heat is determined as the MDS to be migrated.
  • the migration request may carry metadata information of the subtree D in the MDS2.
  • the metadata information includes information such as a file name, an attribute, and a size.
  • the MDS1 constructs metadata in the subtree D in the cache.
  • the MDS1 replies to the MDS2 with a response message that the migration preparation is completed.
  • the MDS2 notifies the MDS1 to start migrating the subtree D by sending a migration request to the MDS1.
  • MDS1 can construct the metadata in subtree D in the cache according to the metadata information in the request, and can reply MDS2 with a migration preparation completion message.
  • the MDS2 sends a migration notification message to the central node.
  • a migration notification message may also be sent to the central node to notify the central node that the attribution of the subtree D becomes MDS1.
  • the central node updates the subtree path attribution table.
  • the central node may modify the subtree path attribution table, update the home data node of the subtree D from MDS2 to MDS1, obtain a new subtree path attribution table, and obtain the new subtree path attribution table.
  • the subtree path ownership table is persisted to disk, which is saved to disk.
  • the new subtree path attribution table can be as shown in Table 2 below:
  • the subtree path structure in the cache may be updated, that is, the new subtree path attribution table is cached in the form of a tree in the memory, as shown in FIG. 7, which is a schematic diagram of the updated subtree path structure.
  • the central node returns a response message of successful update to the MDS2.
  • MDS2 notifies MDS1 that the migration is successful.
  • a response message of successful update may be returned to the MDS2 to notify the MDS2 that the attribution table is changed.
  • MDS2 can send a message to MDS1 to notify the subtree to migrate successfully.
  • the MDS1 changes the attribution information of the subtree D in the cache.
  • the MDS1 returns a response message that the migration succeeds to the MDS2.
  • the MDS1 can change the subtree attribution information in the cache to itself, and can return a response message to the MDS2.
  • the MDS2 can change the subtree ownership information in the cache to MDS1.
  • the MDS that takes over the subtree of the MDS2 can be determined.
  • the MDS2 fault condition may be detected by any MDS or a central node other than the MDS2 in the distributed file system where the MDS2 is located.
  • the MDS that needs to be taken over may be determined by any MDS or central node in the distributed file system in which the MDS 2 is located. For example, according to the heat of each MDS, the MDS1 with the lowest heat is determined as the MDS taken over.
  • the MDS1 can send a takeover request to the central node (if it is determined that the data node to be taken over is the central node, the takeover request may not be sent), and the takeover request may carry the identification information of the MDS2 to inform the central node
  • the MDS2 fails, and MDS1 needs to take over the subtree of MDS2.
  • the central node updates the subtree path attribution table.
  • the central node returns the subtree information of the MDS2 to the MDS1.
  • the central node may search the subtree path attribution table, change the home information of all subtrees belonging to the MDS2 to the MDS1, obtain a new subtree path attribution table, and obtain the new The subtree path ownership table is persisted to disk, which is saved to disk.
  • the new subtree path attribution table can be as shown in Table 3 below:
  • the central node may also return all subtree information attributed to the MDS2 to the MDS1.
  • a new central node may also be determined, for example, the determined new central node is MDS1.
  • MDS1 can read the subtree path attribution table from the low layer storage (because the subtree path attribution table is stored on the disk), and the subtree information of the entire system can be reconstructed in the cache.
  • the fault condition of the central node may be detected by any MDS in the system.
  • the MDS that needs to be taken over may also be determined by any MDS in the system, and is not described here.
  • the MDS1 can also detect whether there is a subtree on the central node of the fault that needs to be taken over.
  • the affiliation relationship between the subtree and the MDS and the subtree path information are managed by the central point, and are persistent to the disk, so that the data nodes (including the MDS or the central node) in the system are faulty.
  • the subtree path attribution table can be reconstructed by reading the persistent information without sending a request to all the MDSs to obtain the obtained subtree information for subtree reconstruction, thereby reducing the message burst amount on the takeover node and reducing the MDS fault time.
  • the tree takes over the time, so that the takeover time is not affected by the cluster size, and the switching time can be controlled within 5s.
  • the subtree path attribution table on the central node is taken as the standard, so that the complexity of reconstructing the subtree information is reduced, and the consistency requirement for the subtree information on each MDS is reduced.
  • the path determining module 11 is configured to determine a subtree path in the distributed file system.
  • the storage module 13 is configured to store node attribution information on the subtree path.
  • connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  • the subtree path attribution table may further include subtree root node information and the like on the subtree path.
  • the device may further include:
  • a second acquiring module configured to acquire an index node number of the target subtree when a target subtree belonging to the first data node needs to be migrated to the second data node;
  • the first data node is a data node that needs to perform subtree migration
  • the second data node is a determined data node that needs to be migrated.
  • the second data node that needs to be migrated may be selected according to a preset rule, such as a data node with the lowest heat, or may be randomly selected, which is not limited in the embodiment of the present invention.
  • subtree migration can be implemented based on the stored node attribution information.
  • a receiving module configured to receive a subtree takeover request indicating that the third data node is faulty
  • a second update module configured to update a home data node of the subtree belonging to the third data node in the subtree path attribution table to a fourth data node in response to the subtree takeover request;
  • a sending module configured to return, to the fourth data node, subtree information attributed to the third data node, so that the fourth data node performs subtree reconstruction in a data cache of the fourth data node.
  • the node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulted based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node fails, and reduces the subnode of the data node. Tree information consistency requirements.
  • FIG. 11 is a schematic structural diagram of a data storage system according to an embodiment of the present invention.
  • the data storage system of the embodiment of the present invention may include a first data node 2 and a central node 1;
  • the first data node 2 is configured to provide a data service for a subtree of the first data node 1;
  • the central node 1 is configured to determine a subtree path in the distributed file system where the first data node 2 is located; acquire node attribution information on the subtree path, and store nodes on the subtree path Ownership information.
  • the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree.
  • the central node 1 is further configured to generate a subtree path attribution table including node attribution information on the subtree path, and store the subtree path attribution table;
  • connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  • the subtree path attribution table may further include subtree root node information on the subtree path.
  • system further includes: a second data node 3; wherein
  • the central node 1 is configured to search, from the subtree path attribution table, an index node number that is the same as an index node number of the target subtree; and locate the found index node in the subtree path attribution table.
  • the home data node corresponding to the number is updated to the second data node 3.
  • the first data node 2 is configured to provide a data service for a subtree of the first data node
  • the second data node 3 is configured to provide a data service for a subtree of the second data node.
  • the first data node 2 is further configured to update the home data node of the target subtree in the data cache of the first data node 2 to the second data node 3.
  • the second data node 3 can also construct a data cache. After updating the subtree path attribution table, the second data node 3 can change the home data node of the target subtree in the cache to itself. The first data node 2 can change the home data node of the target subtree in the cache to the second data node 3. In order to quickly extract subtree information.
  • the fourth data node 5 is configured to send, to the central node 1, a subtree takeover request for indicating that the third data node 4 is faulty;
  • the fourth data node 5 is further configured to perform subtree reconstruction in a data cache of the fourth data node based on the subtree information of the third data node 4.
  • FIG. 12 is a schematic structural diagram of a data server according to an embodiment of the present invention.
  • the data server in the embodiment of the present invention includes: a communication interface 300, a memory 200, and a processor 100, and the processor 100 is respectively connected to the communication interface 300 and the memory 200.
  • the memory 200 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the communication interface 300, the memory 200, and the processor 100 may be connected to each other through a bus, or may be connected by other means. In the present embodiment, a bus connection will be described.
  • the data server in the embodiment of the present invention may correspond to the central node in the corresponding embodiment of FIG. 1 to FIG. 11 , which may specifically be a data node in the distributed file system, such as an MDS or a DS. Please refer to the related description in the corresponding embodiments of FIG. 1 to FIG. among them,
  • the memory 200 is configured to store driver software
  • the processor 100 reads the driver software from the memory and executes it under the action of the driver software:
  • the node attribution information on the subtree path is stored.
  • the processor 100 performs the storing the node attribution information on the subtree path by using the driving software, and specifically performing the following steps:
  • connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  • the subtree path attribution table may further include subtree root node information and the like on the subtree path.
  • the processor 100 is further configured to perform the following steps by using the driver software:
  • the subtree information belonging to the third data node in the subtree path attribution table before the update to the fourth data node, so that the fourth data node is in the Subtree reconstruction is performed in the data cache of the fourth data node.
  • the node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulty based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node is in an MDS failure, and reduces the MDS. Sub-tree information consistency requirements.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules is only a logical function division.
  • there may be another division manner for example, multiple modules or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
  • the modules described as separate components may or may not be physically separated.
  • the components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. . Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the above-described integrated modules implemented in the form of software function modules can be stored in a computer readable storage medium.
  • the software function modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods of the various embodiments of the present invention. Part of the steps.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, and the program code can be stored. Medium.

Abstract

A data storage method, device and system, the method comprising: determining a subtree path in a distributed file system (101); obtaining home node information on the substree path, the home node information indicating information about connections between subtrees on the subtree path and home data nodes of all subtrees (102); and storing the home node information on the subtree path (103). The present technical solution can resolve the problem of long subtree management takeover time when a fault occurs to an MDS, and lower the requirement for the consistency of subtree information of the MDS.

Description

一种数据存储方法、装置及系统Data storage method, device and system 技术领域Technical field
本发明涉及通信技术领域,尤其涉及一种数据存储方法、装置及系统。The present invention relates to the field of communications technologies, and in particular, to a data storage method, apparatus, and system.
背景技术Background technique
分布式文件系统(Distributed File System)是指将固定于某个节点的某个文件系统,扩展到任意多个节点/多个文件系统,通过众多的节点相连形成集群,从而组成的一个文件系统网络。系统中通常包括以下三种类型的节点:协议服务器(Protocol Server)、元数据服务器(Meta Data Server,简称MDS)、数据服务器(Data Server,简称DS)。该分布式文件系统划分元数据服务的方式有多种,比如基于动态子树的方式。在基于动态子树的方式下,每个节点维护一个或多个子树,即负责一部分元数据服务。A distributed file system (Distributed File System) is a file system network that is formed by extending a file system fixed to a node to any number of nodes/file systems and connecting them through a plurality of nodes. . The system usually includes the following three types of nodes: Protocol Server, Meta Data Server (MDS), and Data Server (DS). The distributed file system has various ways of dividing metadata services, such as a dynamic subtree based approach. In a dynamic subtree-based manner, each node maintains one or more subtrees, which is responsible for a portion of the metadata service.
目前,在分布式文件系统中的某个MDS故障时,需要选择一个正常MDS(即接管者)将归属于故障MDS的子树接管过来,以便继续提供元数据服务,这一过程称为子树接管或子树恢复。子树接管的过程如下:接管者向系统中的所有节点收集子树路径信息及归属关系,并对收集的子树信息进行去重,重新生成故障MDS上的子树信息,即对故障MDS进行重建。该方式下,由于接管者需要向系统中所有MDS发送请求,使得在接管过程中集群较大,节点的收发消息量较大时,容易造成单点瓶颈,导致接管时间较长。而且,该方式要求各MDS的子树信息有较高的一致性,若各MDS的子树信息不一致则容易导致接管者的重建算法复杂。Currently, when an MDS in a distributed file system fails, it is necessary to select a normal MDS (ie, the receiver) to take over the subtree belonging to the faulty MDS to continue to provide the metadata service. This process is called a subtree. Take over or subtree recovery. The subtree takes over as follows: the receiver collects the subtree path information and the affiliation relationship from all the nodes in the system, and deduplicates the collected subtree information to regenerate the subtree information on the faulty MDS, that is, the faulty MDS. reconstruction. In this mode, the receiver needs to send a request to all the MDSs in the system, so that when the cluster is large during the takeover process and the number of sent and received messages of the node is large, a single point bottleneck is easily caused, resulting in a long takeover time. Moreover, the method requires that the sub-tree information of each MDS has a high consistency. If the sub-tree information of each MDS is inconsistent, the receiver's reconstruction algorithm is likely to be complicated.
发明内容Summary of the invention
本发明实施例提供一种数据存储方法、装置及系统,能够解决数据节点故障时子树接管时间长的问题,并降低了对数据节点的子树信息的一致性要求。The embodiment of the invention provides a data storage method, device and system, which can solve the problem that the subtree takes over a long time when a data node fails, and reduces the consistency requirement of the subtree information of the data node.
第一方面,本发明实施例提供了一种数据存储方法,包括:In a first aspect, an embodiment of the present invention provides a data storage method, including:
确定分布式文件系统中的子树路径,获取该子树路径上的节点归属信息,并存储该子树路径上的节点归属信息。Determining a subtree path in the distributed file system, obtaining node attribution information on the subtree path, and storing node attribution information on the subtree path.
其中,该子树路径可包括分布式文件系统中各数据节点中的子树的根节点(子树根节点,简称“子树根”)以及各数据节点的子树的根节点到该系统的根节点的路径上的子树节点。该节点归属信息指示了该子树路径上的子树间的连接信息以及每一个子树的归属数据节点。从而能够基于该存储的节点归属信息来进行子树迁移或子树接管等操作,降低了对系统中的子树信息的一致性要求。The subtree path may include a root node (a subtree root node, referred to as a “subtree root”) of a subtree in each data node in the distributed file system, and a root node of the subtree of each data node to the system. A subtree node on the path of the root node. The node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree. Therefore, operations such as subtree migration or subtree takeover can be performed based on the stored node attribution information, thereby reducing consistency requirements for subtree information in the system.
在可选的实施例中,所述存储所述子树路径上的节点归属信息,可以具体为:生成包括子树路径上的节点归属信息的子树路径归属表,并存储该子树路径归属表。如可将该子树路径归属表写入磁盘进行持久化存储。In an optional embodiment, the storing the node attribution information on the subtree path may be specifically: generating a subtree path attribution table including the node attribution information on the subtree path, and storing the subtree path attribution table. If the subtree path attribution table can be written to the disk for persistent storage.
其中,该子树路径上的子树间的连接信息包括该子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。可选的,该子树路径归属表中还可包括该子树路径上的子树根节点信息等等。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node. Optionally, the subtree path attribution table may further include subtree root node information and the like on the subtree path.
在可选的实施例中,当归属于第一数据节点的目标子树需要迁移到第二数据节点时,可获取该目标子树的索引节点号;从该子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。其中,该目标子树的索引节点号可以具体指该目标子树的 子树根节点的索引节点号。从而能够基于该存储的节点归属信息实现子树迁移。In an optional embodiment, when the target subtree belonging to the first data node needs to be migrated to the second data node, the index node number of the target subtree may be obtained; and the subtree path attribution table is found out An index node number with the same index node number of the target subtree; and the home data node corresponding to the found index node number in the subtree path attribution table is updated to the second data node. The index node number of the target subtree may specifically refer to the target subtree. The index node number of the root node of the subtree. Thereby subtree migration can be implemented based on the stored node attribution information.
在可选的实施例中,还可接收指示第三数据节点出现故障的子树接管请求;响应该子树接管请求,将该子树路径归属表中归属于第三数据节点的子树的归属数据节点更新为第四数据节点;向所述第四数据节点返回更新之前该子树路径归属表中归属于该第三数据节点的子树信息,以使所述第四数据节点在所述第四数据节点的数据缓存中进行子树重建。使得能够基于该存储的节点归属信息进行子树接管,从而解决了数据节点如MDS故障时子树接管时间长的问题。In an optional embodiment, a subtree takeover request indicating that the third data node is faulty may also be received; and in response to the subtree takeover request, the belonging of the subtree belonging to the third data node in the subtree path attribution table Updating the data node to a fourth data node; and returning, to the fourth data node, subtree information belonging to the third data node in the subtree path attribution table before updating, so that the fourth data node is in the Subtree reconstruction is performed in the data cache of the four data nodes. The subtree takeover can be performed based on the stored node attribution information, thereby solving the problem that the subtree takeover time is long when the data node is in an MDS fault.
第二方面,本发明实施例还提供了一种数据存储系统,包括:第一数据节点以及中心节点;其中,In a second aspect, the embodiment of the present invention further provides a data storage system, including: a first data node and a central node; wherein
第一数据节点,用于为所述第一数据节点的子树提供数据服务;a first data node, configured to provide a data service for a subtree of the first data node;
中心节点,用于确定该第一数据节点所在的分布式文件系统中的子树路径;获取所述子树路径上的节点归属信息,并存储所述子树路径上的节点归属信息。And a central node, configured to determine a subtree path in the distributed file system where the first data node is located; acquire node attribution information on the subtree path, and store node attribution information on the subtree path.
其中,该子树路径可包括分布式文件系统中各数据节点中的子树的根节点以及各数据节点的子树的根节点到该系统的根节点的路径上的子树节点。该节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点。可选的,该中心节点可以为该分布式文件系统中的任一数据节点如MDS,即在数据节点上叠加一种身份,其依然具有提供元数据服务的功能;或者为在该系统单独设置的节点,本发明实施例不做限定。The subtree path may include a root node of a subtree in each data node in the distributed file system and a subtree node on a path of a root node of the subtree of each data node to a root node of the system. The node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree. Optionally, the central node may be any data node in the distributed file system, such as an MDS, that is, an identity is superimposed on the data node, and the function still provides a metadata service; or is separately set in the system. The node of the present invention is not limited.
在可选的实施例中,该中心节点还可用于生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表。In an optional embodiment, the central node may be further configured to generate a subtree path attribution table including node attribution information on the subtree path, and store the subtree path attribution table.
其中,该子树路径上的子树间的连接信息可包括该子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。可选的,该子树路径归属表中还可包括该子树路径上的子树根节点信息、与该索引节点号相关联的子树节点名称等等。The connection information between the subtrees on the subtree path may include an inode number of each subtree node on the subtree path and an inode number of a previous subtree node of each subtree node. Optionally, the subtree path attribution table may further include subtree root node information on the subtree path, a subtree node name associated with the inode number, and the like.
进一步的,所述系统还可包括:第二数据节点;其中,Further, the system may further include: a second data node; wherein
所述第一数据节点,还可用于向中心节点发送迁移通知消息,该迁移通知消息指示了需要迁移的目标子树的索引节点号以及需要迁移到的第二数据节点;The first data node may be further configured to send a migration notification message to the central node, where the migration notification message indicates an inode number of the target subtree to be migrated and a second data node to be migrated to;
所述中心节点,还用于从该子树路径归属表中查找出与该目标子树的索引节点号相同的索引节点号;将该子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为该第二数据节点。The central node is further configured to: find, from the subtree path attribution table, an index node number that is the same as an index node number of the target subtree; and corresponding to the found index node number in the subtree path attribution table. The home data node is updated to the second data node.
其中,该第一数据节点为需要进行子树迁移的数据节点,该第二数据节点为确定出的需要迁移到的数据节点。可选的,该需要迁移到的第二数据节点可以是按照预设规则选择出的,如热度最低的数据节点,也可以是随机选择出的,本发明实施例不做限定。从而能够基于该存储的节点归属信息实现子树迁移。The first data node is a data node that needs to perform subtree migration, and the second data node is a determined data node that needs to be migrated. Optionally, the second data node that needs to be migrated may be selected according to a preset rule, such as a data node with the lowest heat, or may be randomly selected, which is not limited in the embodiment of the present invention. Thereby subtree migration can be implemented based on the stored node attribution information.
进一步的,所述第二数据节点,还可用于构建数据缓存,并将所述缓存中所述目标子树的归属数据节点更新为所述第二数据节点;Further, the second data node may be further configured to construct a data cache, and update a home data node of the target subtree in the cache to the second data node;
所述第一数据节点,还用于将所述第一数据节点的数据缓存中所述目标子树的归属数据节点更新为所述第二数据节点。The first data node is further configured to update a home data node of the target subtree in a data cache of the first data node to the second data node.
也就是说,该第二数据节点还可构建数据缓存,在更新了该子树路径归属表之后,第二数据节点即可将缓存中的该目标子树的归属数据节点改为自己,该第一数据节点即可将缓存中的该目标子树的归属数据节点改为该第二数据节点。以便于快速提取子树信息。 That is, the second data node may also construct a data cache. After updating the subtree path attribution table, the second data node may change the home data node of the target subtree in the cache to itself. A data node can change the home data node of the target subtree in the cache to the second data node. In order to quickly extract subtree information.
在可选的实施例中,所述系统可包括:第三数据节点和第四数据节点,该第三数据节点为出现故障的数据节点,该第四数据节点为确定出的接管该第三数据节点的子树的数据节点;其中,In an optional embodiment, the system may include: a third data node that is a failed data node, and a fourth data node that determines the takeover of the third data The data node of the subtree of the node;
所述第四数据节点,用于向中心节点发送用于指示该第三数据节点出现故障的子树接管请求;The fourth data node is configured to send, to the central node, a subtree takeover request for indicating that the third data node is faulty;
所述中心节点,用于接收该子树接管请求,将该子树路径归属表中归属于第三数据节点的子树的归属数据节点更新为第四数据节点;并向该第四数据节点返回该更新之前该子树路径归属表中归属于该第三数据节点的子树信息;The central node is configured to receive the subtree takeover request, update the home data node of the subtree belonging to the third data node in the subtree path attribution table to a fourth data node, and return to the fourth data node Subtree information attributed to the third data node in the subtree path attribution table before the update;
所述第四数据节点,还用于基于该第三数据节点的子树信息在所述第四数据节点的数据缓存中进行子树重建。The fourth data node is further configured to perform subtree reconstruction in a data cache of the fourth data node based on the subtree information of the third data node.
其中,该中心节点、第一数据节点、第二数据节点、第三数据节点、第四数据节点等两两之间可以为同一数据节点,或者为不同的节点,本发明实施例不做限定。从而能够基于该存储的节点归属信息进行子树迁移及子树接管,由此解决了数据节点如MDS故障时子树接管时间长的问题,并降低了对系统中的子树信息的一致性要求。The central node, the first data node, the second data node, the third data node, the fourth data node, and the like may be the same data node or a different node, which is not limited in the embodiment of the present invention. Therefore, the subtree migration and the subtree takeover can be performed based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node is faulty, such as the MDS, and reduces the consistency requirement of the subtree information in the system. .
第三方面,本发明实施例还提供了一种数据存储装置,包括:路径确定模块、第一获取模块以及存储模块,该数据存储装置可通过上述模块实现上述第一方面的数据存储方法的部分或全部的步骤。In a third aspect, the embodiment of the present invention further provides a data storage device, including: a path determining module, a first obtaining module, and a storage module, wherein the data storage device can implement the part of the data storage method of the first aspect by using the foregoing module Or all the steps.
第四方面,本发明实施例还提供了一种计算机存储介质,所述计算机存储介质存储有程序,所述程序执行时包括上述第一方面的数据存储方法的部分或全部的步骤。In a fourth aspect, an embodiment of the present invention further provides a computer storage medium, where the computer storage medium stores a program, and the program includes some or all of the steps of the data storage method of the first aspect.
第五方面,本发明实施例还提供了一种数据服务器,包括:存储器和处理器,所述存储器与所述处理器连接;其中,In a fifth aspect, an embodiment of the present invention further provides a data server, including: a memory and a processor, where the memory is connected to the processor;
所述存储器用于存储驱动软件;The memory is used to store driver software;
所述处理器用于从所述存储器读取所述驱动软件并在所述驱动软件的作用下执行:The processor is configured to read the driver software from the memory and execute the function of the driver software:
确定分布式文件系统中的子树路径;Determining the subtree path in the distributed file system;
获取所述子树路径上的节点归属信息,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点;Obtaining node attribution information on the subtree path, where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree;
存储所述子树路径上的节点归属信息。The node attribution information on the subtree path is stored.
可选的,所述处理器还用于从所述存储器读取所述驱动软件并在所述驱动软件的作用下执行上述第一方面的数据存储方法的部分或全部的步骤。Optionally, the processor is further configured to read the driver software from the memory and perform some or all of the steps of the data storage method of the first aspect described above by the driver software.
实施本发明实施例,具有如下有益效果:Embodiments of the present invention have the following beneficial effects:
在本发明实施中,可通过检测分布式文件系统中的子树路径,获取包括该子树路径上的子树间的连接信息以及每一个子树的归属数据节点的节点归属信息,并存储该节点归属信息,使得能够基于该存储的节点归属信息进行子树间的迁移和子树故障时的接管流程,从而解决了数据节点故障时子树接管时间长的问题,并降低了对数据节点的子树信息的一致性要求。In the implementation of the present invention, by detecting a subtree path in the distributed file system, obtaining connection information between subtrees on the subtree path and node attribution information of the home data node of each subtree, and storing the node information The node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulted based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node fails, and reduces the subnode of the data node. Tree information consistency requirements.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根 据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and it can be rooted by those skilled in the art without any creative work. Other figures are obtained from these figures.
图1是本发明实施例提供的一种分布式文件系统的架构图;FIG. 1 is a structural diagram of a distributed file system according to an embodiment of the present invention;
图2是本发明实施例提供的一种数据存储方法的流程示意图;2 is a schematic flowchart of a data storage method according to an embodiment of the present invention;
图3是本发明实施例提供的一种子树分组示意图;3 is a schematic diagram of a subtree grouping according to an embodiment of the present invention;
图4是本发明实施例提供的一种子树路径示意图;4 is a schematic diagram of a subtree path according to an embodiment of the present invention;
图5是本发明实施例提供的一种数据存储方法的交互示意图;FIG. 5 is a schematic diagram of interaction of a data storage method according to an embodiment of the present invention;
图6是本发明实施例提供的一种子树迁移示意图;FIG. 6 is a schematic diagram of a subtree migration according to an embodiment of the present invention;
图7是本发明实施例提供的另一种子树路径示意图;FIG. 7 is a schematic diagram of another seed tree path according to an embodiment of the present invention;
图8是本发明实施例提供的另一种数据存储方法的交互示意图;FIG. 8 is a schematic diagram of interaction of another data storage method according to an embodiment of the present invention; FIG.
图9是本发明实施例提供的又一种子树路径示意图;FIG. 9 is a schematic diagram of another seed tree path according to an embodiment of the present invention; FIG.
图10是本发明实施例提供的一种数据存储装置的结构示意图;FIG. 10 is a schematic structural diagram of a data storage device according to an embodiment of the present invention;
图11是本发明实施例提供的一种数据存储系统的结构示意图;11 is a schematic structural diagram of a data storage system according to an embodiment of the present invention;
图12是本发明实施例的一种数据服务器的结构示意图。FIG. 12 is a schematic structural diagram of a data server according to an embodiment of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
在分布式文件系统中,通常包括协议服务器(Protocol Server)、MDS、DS三种类型的节点,多个节点相连形成集群。其中,协议服务器负责向用户提供标准的网络文件系统(Network File System,简称NFS)、SMB(Server Message Block)网络文件服务功能。MDS负责提供解析路径、查找文件、创建文件等元数据相关的服务。MDS负责存储文件的数据。用户程序可通过协议客户端(Protocol client)访问文件系统。协议客户端与协议服务器通过前端网络(FE)相连,文件系统内部各节点通过后端网络(BE)连接。前端网络用于用户业务与分布式文件系统之间进行请求与数据交互,后端网络用于分布式文件系统内部各个节点设备之间进行请求与数据交互。进一步的,该三种类型的节点可为逻辑节点,实际部署时可以部署于同一物理设备上,也可以分开部署,本发明实施例不做限定。In a distributed file system, it usually includes three types of nodes: a protocol server (Protocol Server), an MDS, and a DS. A plurality of nodes are connected to form a cluster. The protocol server is responsible for providing the user with a standard Network File System (NFS) and SMB (Server Message Block) network file service function. MDS is responsible for providing metadata related services such as parsing paths, finding files, and creating files. The MDS is responsible for storing the data of the file. The user program can access the file system through a protocol client. The protocol client and the protocol server are connected through a front-end network (FE), and each node in the file system is connected through a back-end network (BE). The front-end network is used for request and data interaction between the user service and the distributed file system, and the back-end network is used for request and data interaction between the various node devices in the distributed file system. Further, the three types of nodes may be logical nodes, and may be deployed on the same physical device or may be deployed separately, which is not limited in the embodiment of the present invention.
在分布式文件系统中,为了达到更好的性能,可将整系统的元数据服务分散到各个MDS(或DS)上,每个MDS(或DS)负责一部分元数据服务。同时,在某个或某几个MDS(或DS)发生故障时系统也能继续对外提供元数据服务,从而提升系统可靠性。本发明实施例以数据节点为MDS为例进行说明,该分布式文件系统划分元数据服务即划分MDS的方式有多种,比如基于动态子树(Subtree)的方式。具体的,系统可根据业务的访问类型(如getattr、setattr、readdir等)、访问频率等因素在缓存中统计每一个文件、目录(或目录分片)及目录树的热度,不同的访问类型累计不同的热度,且访问频率越高,热度越高。随着热度的增加,从而可将整棵目录树划分成多棵子树,并迁移到不同的MDS上,每个MDS负责提供归属于自己的子树中的元数据服务。其中,每个MDS上只需要缓存与自己相关的元数据,包括归属于自己的那部分子树元数据(用于提供元数据服务),本子树与下级子树的边界点(向下访问的时候能够知道元数据从哪个地方开始归属于另外一个MDS),以及本子树的根到根目录之间的路径(在节点故障时,依然能够解析遍历全路径)。 In a distributed file system, in order to achieve better performance, the entire system's metadata service can be distributed to each MDS (or DS), and each MDS (or DS) is responsible for a part of the metadata service. At the same time, the system can continue to provide metadata services when one or several MDSs (or DSs) fail, thereby improving system reliability. The embodiment of the present invention is described by taking a data node as an MDS as an example. The distributed file system divides the metadata service, that is, the manner of dividing the MDS, for example, based on a dynamic subtree. Specifically, the system can count the heat of each file, directory (or directory fragment) and directory tree in the cache according to the access type of the service (such as getattr, setattr, readdir, etc.) and the access frequency, and the different access types are accumulated. Different heats, and the higher the frequency of access, the higher the heat. As the heat increases, the entire tree can be divided into multiple subtrees and migrated to different MDSs. Each MDS is responsible for providing metadata services belonging to its own subtree. Among them, each MDS only needs to cache the metadata related to itself, including the part of the subtree metadata belonging to itself (used to provide metadata service), the boundary point of the subtree and the subordinate subtree (downward access) It is possible to know where the metadata originated from another MDS, and the path between the root of the subtree and the root directory (when the node fails, it can still parse the full path).
进一步的,随着每个MDS的负载变化、子树热度的变化等等,子树可以在MDS间自由迁移,并且可以再分裂成更小的子树,或者合并成大的子树。其子树的划分及归属会随着业务的访问(即热度,也即节点的负载)动态地调整,而不是静态的。Further, as each MDS load changes, subtree heat changes, etc., the subtrees can move freely between MDSs and can be split into smaller subtrees or merged into large subtrees. The division and attribution of its subtrees are dynamically adjusted with the access of the business (ie, the heat, ie the load of the nodes), rather than static.
本申请通过引入中心节点,以通过该中心节点集中管理整系统中的子树信息。其中,该中心节点可以为系统中的任一个MDS(或DS)节点(即在MDS上叠加一种身份,其作为MDS依然具有提供元数据服务的功能),或者为额外设置的节点,本发明实施例不做限定。从而解决了大集群下子树接管时单点消息量大和接管时间长的问题,同时也简化了接管逻辑。The present application introduces a central node to centrally manage subtree information in the entire system through the central node. Wherein, the central node may be any MDS (or DS) node in the system (ie, an identity is superimposed on the MDS, which still has the function of providing a metadata service as the MDS), or an additional set node, the present invention The embodiment is not limited. Therefore, the problem of large single-point message volume and long takeover time when the sub-tree is taken over in a large cluster is solved, and the takeover logic is also simplified.
本发明实施例公开了一种数据存储方法、装置、数据服务器及系统,能够解决数据节点故障时子树接管时间长的问题,并降低了对数据节点的子树信息的一致性要求。以下分别详细说明。The embodiment of the invention discloses a data storage method, device, data server and system, which can solve the problem that the subtree takes over a long time when the data node fails, and reduces the consistency requirement of the subtree information of the data node. The details are explained below.
请参见图2,图2是本发明实施例提供的一种数据存储方法的流程示意图。具体的,本发明实施例的所述方法可具体应用于中心节点如MDS或DS中。如图2所示,本发明实施例的所述方法可包括以下步骤:Referring to FIG. 2, FIG. 2 is a schematic flowchart of a data storage method according to an embodiment of the present invention. Specifically, the method of the embodiment of the present invention may be specifically applied to a central node such as an MDS or a DS. As shown in FIG. 2, the method of the embodiment of the present invention may include the following steps:
101、确定分布式文件系统中的子树路径。101. Determine a subtree path in the distributed file system.
其中,某一子树的子树路径可以是指该子树的根到系统根目录的路径。在本发明实施例中,可先确定出分布式文件系统(以下简称“系统”)中各子树的子树路径,也即,该系统中的子树路径可包括系统中各数据节点中的子树的根节点以及各数据节点的子树的根节点到系统的根节点(系统根目录)的路径上的子树节点。The subtree path of a subtree may refer to the path from the root of the subtree to the root of the system. In the embodiment of the present invention, the subtree path of each subtree in the distributed file system (hereinafter referred to as “system”) may be determined first, that is, the subtree path in the system may include each data node in the system. The root node of the subtree and the root node of the subtree of each data node to the subtree node on the path of the root node (system root directory) of the system.
102、获取所述子树路径上的节点归属信息,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点。102. Acquire node attribution information on the subtree path, where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree.
其中,该子树路径上的子树间的连接信息可包括每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号等信息。该索引节点号为子树节点的标识信息,每一个索引节点号在该分布式文件系统中唯一确定一个子树节点(如目录、文件等)。The connection information between the subtrees on the subtree path may include information such as an inode number of each subtree node and an inode number of a previous subtree node of each subtree node. The index node number is identification information of the subtree node, and each index node number uniquely identifies a subtree node (such as a directory, a file, etc.) in the distributed file system.
103、存储所述子树路径上的节点归属信息。103. Store node attribution information on the subtree path.
可选的,所述存储所述子树路径上的节点归属信息,可以具体为:生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表。也即,该子树路径归属表中包括系统中所有子树到系统根节点的子树路径,可选的,该子树路径归属表中还可包括子树路径上的子树根节点信息、与该索引节点号相关联的子树节点的名称等等。Optionally, the storing the node attribution information on the subtree path may be specifically: generating a subtree path attribution table including the node attribution information on the subtree path, and storing the subtree path attribution table . That is, the subtree path attribution table includes subtree paths of all the subtrees in the system to the system root node. Optionally, the subtree path attribution table may further include subtree root node information on the subtree path. The name of the subtree node associated with the inode number, and so on.
请参见图3,图3是本发明实施例提供的一种子树分组示意图。如图3所示,分布式文件系统中包括3个数据节点,即MDS1、MDS2和MDS3,且被划分为4个子树,包括子树A、B、C和D。其中,子树A中的子树节点(/、var、lib、etc、usr、lib、lib.so.6)归属于MDS1,子树B中的子树节点(log、messages、news、news.err)、子树D中的子树节点(如vim72、plugin)归属于MDS2,子树C中的子树节点(share、bash、helpfiles、kill、cd、vim、site)归属于MDS3;该“/”为该分布式文件系统中的根节点,即系统根目录;该分布式文件系统中的子树路径上的子树节点为/、var、usr、log、share、vim、vim72,假设分别记为ino1、ino2、ino3、ino4、ino5、ino6、ino7(当然,还可标识为其他形式,此处仅用于举例),该子树路径上的子树节点与其对应的标识信息ino相关联(如“/”与ino1关联、“var”与ino2关联、“usr”与ino3关联、“log”与ino4关联、“share”与ino5关联、“vim”与 ino6关联、“vim72”与ino7关联),如图4所示。进一步的,基于该子树路径上的子树节点的节点归属信息,可生成一个子树路径归属表,如下表一所示:Referring to FIG. 3, FIG. 3 is a schematic diagram of a subtree grouping according to an embodiment of the present invention. As shown in FIG. 3, the distributed file system includes three data nodes, namely, MDS1, MDS2, and MDS3, and is divided into four subtrees, including subtrees A, B, C, and D. Among them, the subtree nodes (/, var, lib, etc, usr, lib, lib.so.6) in subtree A belong to MDS1, subtree nodes in subtree B (log, messages, news, news. Err), the subtree nodes in subtree D (such as vim72, plugin) belong to MDS2, and the subtree nodes (share, bash, helpfiles, kill, cd, vim, site) in subtree C belong to MDS3; /" is the root node in the distributed file system, that is, the system root directory; the subtree nodes on the subtree path in the distributed file system are /, var, usr, log, share, vim, vim72, assuming respectively Recorded as ino1, ino2, ino3, ino4, ino5, ino6, ino7 (of course, can also be identified as other forms, here only for example), the subtree node on the subtree path is associated with its corresponding identification information ino (eg "/" associated with ino1, "var" associated with ino2, "usr" associated with ino3, "log" associated with ino4, "share" associated with ino5, "vim" and Ino6 association, "vim72" is associated with ino7), as shown in Figure 4. Further, based on the node attribution information of the subtree node on the subtree path, a subtree path attribution table may be generated, as shown in the following Table 1:
表一Table I
Figure PCTCN2017082141-appb-000001
Figure PCTCN2017082141-appb-000001
其中,ino为子树路径上的子树节点,dirino为子树节点的上一子树节点(父节点),subtree flag表明了子树根节点(简称“子树根”)信息,为“1”表示该子树节点为数据节点的子树根节点,为“0”表示该子树节点不为数据节点的子树根节点;auth表示子树节点的归属数据节点。该子树根信息subtree flag实际上可作为一个标志位,表示路径上的某个节点是否为子树根(也即上下级节点归属发生变化的边界)。可选的,该子树路径归属表中还可不存储该子树根信息。此时,该子树根信息还可从归属信息反推出来,即根据ino、dirino及auth确定出。例如,ino3的归属数据节点为MDS1,而ino4的归属数据节点为MDS2,归属发生了变化,则可确定从ino4开始,为一棵子树,ino4为该子树的子树根。进一步的,归属表中记录的各ino与/、var、usr、log、share、vim、vim72相关联,该ino为文件或目录属性的一部分,并且在系统中是唯一的,相当于文件、目录等的ID。由此,在归属表中不能以子树节点的名称代替ino,因系统中可能存在多个名称相同的文件(或目录),如可能存在多个叫“var”的目录,导致不能唯一确定该子树节点。Where ino is the subtree node on the subtree path, dirino is the previous subtree node (parent node) of the subtree node, and subtree flag indicates the subtree root node (referred to as "subtree root") information, which is "1" ” indicates that the subtree node is the subtree root node of the data node, and “0” indicates that the subtree node is not the subtree root node of the data node; auth represents the home data node of the subtree node. The subtree flag subtree flag can be used as a flag to indicate whether a node on the path is a subtree root (that is, a boundary where the upper and lower nodes belong to change). Optionally, the subtree root information may not be stored in the subtree path attribution table. At this time, the subtree root information can also be deduced from the attribution information, that is, determined according to ino, dirino, and auth. For example, the home data node of ino3 is MDS1, and the home data node of ino4 is MDS2. If the attribution changes, it can be determined that starting from ino4, it is a subtree, and ino4 is the subtree root of the subtree. Further, each ino recorded in the attribution table is associated with /, var, usr, log, share, vim, vim72, and the ino is part of a file or directory attribute, and is unique in the system, which is equivalent to a file or a directory. Etc. ID. Therefore, ino can not be replaced by the name of the subtree node in the attribution table, because there may be multiple files (or directories) with the same name in the system. If there are multiple directories named "var", it is impossible to uniquely determine the Subtree node.
可选的,具体可将该子树路径归属表写入磁盘进行持久化存储。进一步的,还可在内存中以树的形式缓存该子树路径归属表中的内容,以快速访问该子树路径,并获取子树节点信息。Optionally, the subtree path attribution table may be written to the disk for persistent storage. Further, the content in the subtree path attribution table may be cached in the form of a tree in the memory to quickly access the subtree path and obtain the subtree node information.
进一步可选的,在存储该子树路径上的节点归属信息之后,即可基于该存储的节点归属信息进行子树迁移。具体的,当归属于第一数据节点的目标子树需要迁移到第二数据节点时,中心节点可获取所述目标子树的索引节点号;从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。可选的,该子树迁移流程可以是在节点热度过高时触发的,或者是需要进行系统热度分配调整时触发的,等等;该需要迁移到的第二数据节点可以是该系统中的任一数据节点按照预设规则选择出的,如热度最低的数据节点,也可以是随机选择出的,本发明实施例不做限定。Further optionally, after storing the node attribution information on the subtree path, the subtree migration may be performed based on the stored node attribution information. Specifically, when the target sub-tree belonging to the first data node needs to be migrated to the second data node, the central node may acquire an index node number of the target sub-tree; and find and describe the sub-tree path attribution table from the sub-tree path An index node number of the target sub-tree having the same inode number; updating the home data node corresponding to the found index node number in the subtree path attribution table to the second data node. Optionally, the subtree migration process may be triggered when the node is too hot, or triggered when the system heat allocation adjustment is needed, and so on; the second data node that needs to be migrated may be in the system. Any data node that is selected according to a preset rule, such as the data node with the lowest heat, may also be randomly selected, which is not limited in the embodiment of the present invention.
进一步可选的,在存储该子树路径上的节点归属信息之后,当某一数据节点出现故障时,即可基于该存储的节点归属信息进行子树接管(子树恢复)。具体的,接收指示第三数据节点出现故障的子树接管请求;响应所述子树接管请求,将所述子树路径归属表中归属于所述第三数据节点的子树的归属数据节点更新为第四数据节点;向所述第四数据节点 返回该更新前所述子树路径归属表中归属于所述第三数据节点的子树信息,以使所述第四数据节点在缓存中进行子树重建。其中,该第三数据节点即为出现故障的节点,该第四数据节点即为确定出的用于接管所述第三数据节点的子树的数据节点。可选的,该接管的第四数据节点可以是按照预设规则选择出的,如热度最低的数据节点,也可以是随机选择出的,本发明实施例不做限定。Further optionally, after storing the node attribution information on the subtree path, when a certain data node fails, the subtree takeover (subtree recovery) may be performed based on the stored node attribution information. Specifically, receiving a subtree takeover request indicating that the third data node is faulty; and responding to the subtree takeover request, updating the home data node of the subtree belonging to the third data node in the subtree path attribution table a fourth data node; to the fourth data node Returning the subtree information attributed to the third data node in the subtree path attribution table before the update, so that the fourth data node performs subtree reconstruction in the cache. The third data node is the faulty node, and the fourth data node is the determined data node for taking over the subtree of the third data node. Optionally, the fourth data node of the takeover may be selected according to a preset rule, such as a data node with the lowest heat, or may be randomly selected, which is not limited in the embodiment of the present invention.
在本发明实施中,可通过检测分布式文件系统中的子树路径,获取包括该子树路径上的子树间的连接信息以及每一个子树的归属数据节点的节点归属信息,并存储该节点归属信息,使得能够基于该存储的节点归属信息进行子树间的迁移和子树故障时的接管流程,从而解决了数据节点如MDS故障时子树接管时间长的问题,并降低了对MDS的子树信息的一致性要求。In the implementation of the present invention, by detecting a subtree path in the distributed file system, obtaining connection information between subtrees on the subtree path and node attribution information of the home data node of each subtree, and storing the node information The node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulty based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node is in an MDS failure, and reduces the MDS. Sub-tree information consistency requirements.
进一步的,请参见图5,图5是本发明实施例提供的一种数据存储方法的交互示意图。本发明实施例应用于数据节点之间的子树迁移场景,即归属于第一数据节点的目标子树需要迁移到第二数据节点。在本发明实施例中,假设需要将MDS2(即第一数据节点)的子树D(即目标子树)迁移到MDS1(即第二数据节点),如图6所示。则结合图5,一并参见图6,本发明实施例的所述方法可以包括以下步骤:Further, please refer to FIG. 5. FIG. 5 is a schematic diagram of interaction of a data storage method according to an embodiment of the present invention. The embodiment of the present invention is applied to a subtree migration scenario between data nodes, that is, the target subtree belonging to the first data node needs to be migrated to the second data node. In the embodiment of the present invention, it is assumed that the subtree D (ie, the target subtree) of the MDS 2 (ie, the first data node) needs to be migrated to the MDS 1 (ie, the second data node), as shown in FIG. 6. Referring to FIG. 5, together with FIG. 6, the method of the embodiment of the present invention may include the following steps:
201、MDS2向MDS1发送迁移请求。201. The MDS2 sends a migration request to the MDS1.
可选的,该迁移请求可以是在MDS2的业务访问量即热度过高时,如超过预设热度阈值时触发的。该需要迁移到的MDS可以是MDS2所在的分布式文件系统中的任一MDS或中心节点确定出的,比如根据各MDS的热度,将热度最低的MDS1确定为需要迁移到的MDS。具体的,该迁移请求中可携带MDS2中的子树D的元数据信息。其中,该元数据信息包括文件名、属性、大小等信息。Optionally, the migration request may be triggered when the traffic access of the MDS2 is too hot, such as exceeding a preset heat threshold. The MDS to be migrated to may be determined by any MDS or central node in the distributed file system in which the MDS 2 is located. For example, according to the heat of each MDS, the MDS1 with the lowest heat is determined as the MDS to be migrated. Specifically, the migration request may carry metadata information of the subtree D in the MDS2. The metadata information includes information such as a file name, an attribute, and a size.
202、MDS1在缓存中构建子树D中的元数据。202. The MDS1 constructs metadata in the subtree D in the cache.
203、MDS1向MDS2回复迁移准备完成的响应消息。203. The MDS1 replies to the MDS2 with a response message that the migration preparation is completed.
具体的,MDS2通过向MDS1发送迁移请求,通知MDS1开始迁移子树D。MDS1即可根据请求中的元数据信息在缓存中构建子树D中的元数据,并可向MDS2回复一个迁移准备完成消息。Specifically, the MDS2 notifies the MDS1 to start migrating the subtree D by sending a migration request to the MDS1. MDS1 can construct the metadata in subtree D in the cache according to the metadata information in the request, and can reply MDS2 with a migration preparation completion message.
204、MDS2向中心节点发送迁移通知消息。204. The MDS2 sends a migration notification message to the central node.
此外,当确定MDS2中的子树D需要迁移时,还可向中心节点发送一个迁移通知消息,以通知中心节点该子树D的归属变为MDS1。In addition, when it is determined that the subtree D in the MDS2 needs to be migrated, a migration notification message may also be sent to the central node to notify the central node that the attribution of the subtree D becomes MDS1.
205、中心节点更新子树路径归属表。205. The central node updates the subtree path attribution table.
具体的,中心节点在接收到该迁移通知消息之后,即可修改子树路径归属表,将子树D的归属数据节点由MDS2更新为MDS1,得到新的子树路径归属表,并将该新的子树路径归属表持久化到磁盘,也即保存到磁盘。其中,该新的子树路径归属表可如下表二所示:Specifically, after receiving the migration notification message, the central node may modify the subtree path attribution table, update the home data node of the subtree D from MDS2 to MDS1, obtain a new subtree path attribution table, and obtain the new subtree path attribution table. The subtree path ownership table is persisted to disk, which is saved to disk. The new subtree path attribution table can be as shown in Table 2 below:
表二Table II
Figure PCTCN2017082141-appb-000002
Figure PCTCN2017082141-appb-000002
Figure PCTCN2017082141-appb-000003
Figure PCTCN2017082141-appb-000003
进一步的,还可更新缓存中的子树路径结构,即在内存中以树的形式缓存该新的子树路径归属表,如图7所示,为更新后的子树路径结构示意图。Further, the subtree path structure in the cache may be updated, that is, the new subtree path attribution table is cached in the form of a tree in the memory, as shown in FIG. 7, which is a schematic diagram of the updated subtree path structure.
206、中心节点向MDS2返回更新成功的响应消息。206. The central node returns a response message of successful update to the MDS2.
207、MDS2通知MDS1迁移成功。207. MDS2 notifies MDS1 that the migration is successful.
具体的,中心节点更新子树路径归属表成功之后,即可向MDS2返回一个更新成功的响应消息,以告知MDS2该归属表以更改。MDS2收到中心节点的响应消息后,即可向MDS1发送消息,以通知子树迁移成功。Specifically, after the central node updates the subtree path attribution table successfully, a response message of successful update may be returned to the MDS2 to notify the MDS2 that the attribution table is changed. After receiving the response message from the central node, MDS2 can send a message to MDS1 to notify the subtree to migrate successfully.
208、MDS1更改缓存中子树D的归属信息。208. The MDS1 changes the attribution information of the subtree D in the cache.
209、MDS1向MDS2返回迁移成功的响应消息。209. The MDS1 returns a response message that the migration succeeds to the MDS2.
210、MDS2更改缓存中子树D的归属信息。210. The MDS2 changes the attribution information of the subtree D in the cache.
具体的,MDS1接收到迁移成功的通知消息后,即可将缓存中的子树归属信息改成自己,并可向MDS2返回一个响应消息。MDS2接收到该响应消息后,即可将缓存中的子树归属信息改成MDS1。Specifically, after receiving the notification message of successful migration, the MDS1 can change the subtree attribution information in the cache to itself, and can return a response message to the MDS2. After receiving the response message, the MDS2 can change the subtree ownership information in the cache to MDS1.
进一步的,请参见图8,图8是本发明实施例提供的另一种数据存储方法的交互示意图。本发明实施例可应用于数据节点故障时的子树接管场景,即当第三数据节点出现故障,需要将该第三数据节点的子树迁移到第四数据节点。在本发明实施例中,假设检测到MDS2(即第三数据节点)故障,需要将MDS2的子树迁移到MDS1(即第四数据节点)。如图8所示,本发明实施例的所述方法可以包括以下步骤:Further, please refer to FIG. 8. FIG. 8 is a schematic diagram of interaction of another data storage method according to an embodiment of the present invention. The embodiment of the present invention can be applied to a subtree takeover scenario when a data node is faulty, that is, when the third data node fails, the subtree of the third data node needs to be migrated to the fourth data node. In the embodiment of the present invention, if it is detected that the MDS2 (ie, the third data node) is faulty, the subtree of the MDS2 needs to be migrated to the MDS1 (ie, the fourth data node). As shown in FIG. 8, the method of the embodiment of the present invention may include the following steps:
301、当MDS2故障时,MDS1向中心节点发送接管请求。301. When the MDS2 fails, the MDS1 sends a takeover request to the central node.
具体的,当MDS2故障时,可确定出接管该MDS2的子树的MDS。可选的,该MDS2故障情况可以是MDS2所在的分布式文件系统中除MDS2以外的任一MDS或中心节点检测出的。该需要接管的MDS可以是MDS2所在的分布式文件系统中的任一MDS或中心节点确定出的,比如根据各MDS的热度,将热度最低的MDS1确定为接管的MDS。若接管的MDS为MDS1,MDS1即可向中心节点发送接管请求(若确定接管的数据节点为中心节点,则可不发送该接管请求),该接管请求中可携带MDS2的标识信息,以告知中心节点该MDS2发生故障,MDS1需要接管MDS2的子树。Specifically, when the MDS2 fails, the MDS that takes over the subtree of the MDS2 can be determined. Optionally, the MDS2 fault condition may be detected by any MDS or a central node other than the MDS2 in the distributed file system where the MDS2 is located. The MDS that needs to be taken over may be determined by any MDS or central node in the distributed file system in which the MDS 2 is located. For example, according to the heat of each MDS, the MDS1 with the lowest heat is determined as the MDS taken over. If the MDS of the takeover is MDS1, the MDS1 can send a takeover request to the central node (if it is determined that the data node to be taken over is the central node, the takeover request may not be sent), and the takeover request may carry the identification information of the MDS2 to inform the central node The MDS2 fails, and MDS1 needs to take over the subtree of MDS2.
302、中心节点更新子树路径归属表。302. The central node updates the subtree path attribution table.
303、中心节点向MDS1返回MDS2的子树信息。303. The central node returns the subtree information of the MDS2 to the MDS1.
具体的,中心节点在接收到该接管请求之后,即可查找子树路径归属表,将归属于MDS2的所有子树的归属信息改成MDS1,得到新的子树路径归属表,并将该新的子树路径归属表持久化到磁盘,也即保存到磁盘。其中,该新的子树路径归属表可如下表三所示:Specifically, after receiving the takeover request, the central node may search the subtree path attribution table, change the home information of all subtrees belonging to the MDS2 to the MDS1, obtain a new subtree path attribution table, and obtain the new The subtree path ownership table is persisted to disk, which is saved to disk. The new subtree path attribution table can be as shown in Table 3 below:
表三Table 3
inoIno dirinoDirino subtreeSubtree authAuth
    flagFlag  
ino1Ino1 -- 11 MDS1MDS1
ino2Ino2 ino1Ino1 00 MDS1MDS1
ino3Ino3 ino1Ino1 00 MDS1MDS1
ino4Ino4 ino2Ino2 11 MDS1MDS1
ino5Ino5 ino3Ino3 11 MDS3MDS3
ino6Ino6 ino5Ino5 00 MDS3MDS3
ino7Ino7 ino6Ino6 11 MDS1MDS1
进一步的,还可更新缓存中的子树路径结构,在内存中以树的形式缓存该新的子树路径归属表,如图9所示,为更新后的子树路径结构示意图。Further, the subtree path structure in the cache may be updated, and the new subtree path attribution table is cached in the form of a tree in the memory, as shown in FIG. 9 , which is a schematic diagram of the updated subtree path structure.
进一步的,中心节点在更新子树路径归属表之后,还可将归属于MDS2的所有子树信息返回给MDS1。Further, after updating the subtree path attribution table, the central node may also return all subtree information attributed to the MDS2 to the MDS1.
304、MDS1在缓存中重建MDS2的子树信息。304. The MDS1 reconstructs the subtree information of the MDS2 in the cache.
具体的,MDS1接收到MDS2的子树信息之后,即可将原来归属于MDS2的子树在缓存中重建出来。由于子树路径归属表记录的是精简的子树信息(只记录了ino),因此MDS4重建出来的元数据缓存也是不完整的,称为empty类型。此类型的元数据可以通过业务触发读盘,从而逐步填充完整。由此,当某个MDS节点故障时,接管者可直接从中心节点获取故障MDS的子树信息,无需向所有MDS发送请求,这就减少了系统中的消息传输量以及MDS故障时子树接管时间,并降低了对MDS的子树信息的一致性要求。Specifically, after receiving the subtree information of the MDS2, the MDS1 can reconstruct the subtree originally belonging to the MDS2 in the cache. Since the subtree path attribution table records the reduced subtree information (only ino is recorded), the metadata cache reconstructed by MDS4 is also incomplete, called the empty type. This type of metadata can be read by the business trigger and gradually filled in. Therefore, when an MDS node fails, the receiver can directly obtain the sub-tree information of the faulty MDS from the central node without sending a request to all MDSs, which reduces the amount of message transmission in the system and the subtree takeover when the MDS fails. Time, and reduce the consistency requirements for sub-tree information of MDS.
进一步可选的,若故障的MDS为中心节点,则还可确定出新的中心节点,如确定出的新的中心节点为MDS1。则MDS1可从低层存储中读取子树路径归属表(因该子树路径归属表是存储在磁盘的),并可在缓存中重建出整个系统的子树信息。可选的,该中心节点的故障情况可以是系统中任一MDS检测出的。该需要接管的MDS也可以是系统中的任一MDS确定出的,此处不赘述。进一步的,MDS1还可检测故障的中心节点上是否存在子树需要接管,若存在,则可从子树路径归属表中查找归属于中心节点的子树,并将其归属改成MDS1,得到新的子树路径归属表,并持久化该新的子树路径归属表到磁盘,替换之前的子树路径归属表。Further optionally, if the faulty MDS is a central node, a new central node may also be determined, for example, the determined new central node is MDS1. Then, MDS1 can read the subtree path attribution table from the low layer storage (because the subtree path attribution table is stored on the disk), and the subtree information of the entire system can be reconstructed in the cache. Optionally, the fault condition of the central node may be detected by any MDS in the system. The MDS that needs to be taken over may also be determined by any MDS in the system, and is not described here. Further, the MDS1 can also detect whether there is a subtree on the central node of the fault that needs to be taken over. If yes, the subtree belonging to the central node can be searched from the subtree path attribution table, and the attribution is changed to MDS1 to obtain a new The subtree path belongs to the table, and persists the new subtree path attribution table to the disk, replacing the previous subtree path attribution table.
在本发明实施例中,可通过引入中心点集中管理子树与MDS的归属关系及子树路径信息,并持久化到磁盘中,使得系统中的数据节点(包括MDS或中心节点)故障时,能够通过读取持久化信息来重建子树路径归属表,而无需向所有MDS发送请求已获取子树信息进行子树重建,从而减少了接管节点上的消息突发量,减少了MDS故障时子树接管时间,使得接管时间不受集群规模的影响,切换时间一般可控制在5s内。此外,在进行子树接管时,通过以中心节点上的子树路径归属表为准,使得降低了重建子树信息的复杂度,同时降低了对各MDS上的子树信息的一致性要求。In the embodiment of the present invention, the affiliation relationship between the subtree and the MDS and the subtree path information are managed by the central point, and are persistent to the disk, so that the data nodes (including the MDS or the central node) in the system are faulty. The subtree path attribution table can be reconstructed by reading the persistent information without sending a request to all the MDSs to obtain the obtained subtree information for subtree reconstruction, thereby reducing the message burst amount on the takeover node and reducing the MDS fault time. The tree takes over the time, so that the takeover time is not affected by the cluster size, and the switching time can be controlled within 5s. In addition, when the subtree takes over, the subtree path attribution table on the central node is taken as the standard, so that the complexity of reconstructing the subtree information is reduced, and the consistency requirement for the subtree information on each MDS is reduced.
请参见图10,图10是本发明实施例提供的一种数据存储装置的结构示意图。具体的,如图10所示,本发明实施例的所述数据存储装置可包括路径确定模块11、第一获取模块12以及存储模块13。其中,Referring to FIG. 10, FIG. 10 is a schematic structural diagram of a data storage device according to an embodiment of the present invention. Specifically, as shown in FIG. 10, the data storage device of the embodiment of the present invention may include a path determining module 11, a first obtaining module 12, and a storage module 13. among them,
所述路径确定模块11,用于确定分布式文件系统中的子树路径。The path determining module 11 is configured to determine a subtree path in the distributed file system.
其中,该子树路径可包括分布式文件系统中各数据节点中的子树的根节点以及各数据 节点的子树的根节点到该系统的根节点的路径上的子树节点。The subtree path may include a root node of each subtree in each data node in the distributed file system and each data. The root node of the node's subtree to the subtree node on the path of the root node of the system.
所述第一获取模块12,用于获取所述子树路径上的节点归属信息。The first obtaining module 12 is configured to acquire node attribution information on the subtree path.
其中,该节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点。The node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree.
所述存储模块13,用于存储所述子树路径上的节点归属信息。The storage module 13 is configured to store node attribution information on the subtree path.
可选的,在本发明实施例中,所述存储模块13可具体用于:Optionally, in the embodiment of the present invention, the storage module 13 may be specifically configured to:
生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表。Generating a subtree path attribution table including node attribution information on the subtree path, and storing the subtree path attribution table.
其中,该子树路径上的子树间的连接信息包括该子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。可选的,该子树路径归属表中还可包括该子树路径上的子树根节点信息等等。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node. Optionally, the subtree path attribution table may further include subtree root node information and the like on the subtree path.
进一步可选的,所述装置还可包括:Further optionally, the device may further include:
第二获取模块,用于当归属于第一数据节点的目标子树需要迁移到第二数据节点时,获取所述目标子树的索引节点号;a second acquiring module, configured to acquire an index node number of the target subtree when a target subtree belonging to the first data node needs to be migrated to the second data node;
查找模块,用于从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;a searching module, configured to search, from the subtree path attribution table, an inode number that is the same as an inode number of the target subtree;
第一更新模块,用于将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。And a first update module, configured to update the home data node corresponding to the found index node number in the subtree path attribution table to the second data node.
其中,该第一数据节点为需要进行子树迁移的数据节点,该第二数据节点为确定出的需要迁移到的数据节点。可选的,该需要迁移到的第二数据节点可以是按照预设规则选择出的,如热度最低的数据节点,也可以是随机选择出的,本发明实施例不做限定。从而能够基于该存储的节点归属信息实现子树迁移。The first data node is a data node that needs to perform subtree migration, and the second data node is a determined data node that needs to be migrated. Optionally, the second data node that needs to be migrated may be selected according to a preset rule, such as a data node with the lowest heat, or may be randomly selected, which is not limited in the embodiment of the present invention. Thereby subtree migration can be implemented based on the stored node attribution information.
进一步可选的,所述装置还可包括:Further optionally, the device may further include:
接收模块,用于接收指示第三数据节点出现故障的子树接管请求;a receiving module, configured to receive a subtree takeover request indicating that the third data node is faulty;
第二更新模块,用于响应所述子树接管请求,将所述子树路径归属表中归属于所述第三数据节点的子树的归属数据节点更新为第四数据节点;a second update module, configured to update a home data node of the subtree belonging to the third data node in the subtree path attribution table to a fourth data node in response to the subtree takeover request;
发送模块,用于向所述第四数据节点返回归属于所述第三数据节点的子树信息,以使所述第四数据节点在所述第四数据节点的数据缓存中进行子树重建。And a sending module, configured to return, to the fourth data node, subtree information attributed to the third data node, so that the fourth data node performs subtree reconstruction in a data cache of the fourth data node.
其中,该中心节点、第一数据节点、第二数据节点、第三数据节点、第四数据节点等两两之间可以为同一数据节点,或者为不同的节点,本发明实施例不做限定。从而能够基于该存储的节点归属信息进行子树迁移及子树接管。The central node, the first data node, the second data node, the third data node, the fourth data node, and the like may be the same data node or a different node, which is not limited in the embodiment of the present invention. Thereby, subtree migration and subtree takeover can be performed based on the stored node attribution information.
在本发明实施中,可通过检测分布式文件系统中的子树路径,获取包括该子树路径上的子树间的连接信息以及每一个子树的归属数据节点的节点归属信息,并存储该节点归属信息,使得能够基于该存储的节点归属信息进行子树间的迁移和子树故障时的接管流程,从而解决了数据节点故障时子树接管时间长的问题,并降低了对数据节点的子树信息的一致性要求。In the implementation of the present invention, by detecting a subtree path in the distributed file system, obtaining connection information between subtrees on the subtree path and node attribution information of the home data node of each subtree, and storing the node information The node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulted based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node fails, and reduces the subnode of the data node. Tree information consistency requirements.
请参见图11,图11是本发明实施例提供的一种数据存储系统的结构示意图。具体的,如图11所示,本发明实施例的所述数据存储系统可包括第一数据节点2以及中心节点1;其中, Referring to FIG. 11, FIG. 11 is a schematic structural diagram of a data storage system according to an embodiment of the present invention. Specifically, as shown in FIG. 11, the data storage system of the embodiment of the present invention may include a first data node 2 and a central node 1;
所述第一数据节点2,用于为所述第一数据节点1的子树提供数据服务;The first data node 2 is configured to provide a data service for a subtree of the first data node 1;
所述中心节点1,用于确定所述第一数据节点2所在的分布式文件系统中的子树路径;获取所述子树路径上的节点归属信息,并存储所述子树路径上的节点归属信息。The central node 1 is configured to determine a subtree path in the distributed file system where the first data node 2 is located; acquire node attribution information on the subtree path, and store nodes on the subtree path Ownership information.
其中,该节点归属信息指示了该子树路径上的子树间的连接信息以及每一个子树的归属数据节点。The node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree.
可选的,在本发明实施例中,Optionally, in the embodiment of the present invention,
所述中心节点1,还用于生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表;The central node 1 is further configured to generate a subtree path attribution table including node attribution information on the subtree path, and store the subtree path attribution table;
其中,所述子树路径上的子树间的连接信息包括该子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。可选的,该子树路径归属表中还可包括该子树路径上的子树根节点信息。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node. Optionally, the subtree path attribution table may further include subtree root node information on the subtree path.
进一步可选的,所述系统还包括:第二数据节点3;其中,Further optionally, the system further includes: a second data node 3; wherein
所述第一数据节点2,用于向所述中心节点1发送迁移通知消息,所述迁移通知消息指示了需要迁移的目标子树的索引节点号以及需要迁移到的第二数据节点3;The first data node 2 is configured to send a migration notification message to the central node 1, where the migration notification message indicates an inode number of a target subtree to be migrated and a second data node 3 to be migrated to;
所述中心节点1,用于从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点3。The central node 1 is configured to search, from the subtree path attribution table, an index node number that is the same as an index node number of the target subtree; and locate the found index node in the subtree path attribution table. The home data node corresponding to the number is updated to the second data node 3.
所述第一数据节点2用于为所述第一数据节点的子树提供数据服务,所述第二数据节点3用于为所述第二数据节点的子树提供数据服务。The first data node 2 is configured to provide a data service for a subtree of the first data node, and the second data node 3 is configured to provide a data service for a subtree of the second data node.
进一步可选的,在本发明实施例中,Further optionally, in the embodiment of the present invention,
所述第二数据节点3,还用于构建数据缓存,并将所述缓存中所述目标子树的归属数据节点更新为所述第二数据节点3;The second data node 3 is further configured to construct a data cache, and update the home data node of the target subtree in the cache to the second data node 3;
所述第一数据节点2,还用于将所述第一数据节点2的数据缓存中所述目标子树的归属数据节点更新为所述第二数据节点3。The first data node 2 is further configured to update the home data node of the target subtree in the data cache of the first data node 2 to the second data node 3.
也就是说,该第二数据节点3还可构建数据缓存,在更新了该子树路径归属表之后,第二数据节点3即可将缓存中的该目标子树的归属数据节点改为自己,该第一数据节点2即可将缓存中的该目标子树的归属数据节点改为该第二数据节点3。以便于快速提取子树信息。That is, the second data node 3 can also construct a data cache. After updating the subtree path attribution table, the second data node 3 can change the home data node of the target subtree in the cache to itself. The first data node 2 can change the home data node of the target subtree in the cache to the second data node 3. In order to quickly extract subtree information.
进一步可选的,所述系统还包括:第三数据节点4和第四数据节点5,所述第三数据节点4为出现故障的数据节点;其中,Further optionally, the system further includes: a third data node 4 and a fourth data node 5, wherein the third data node 4 is a failed data node;
所述第四数据节点5,用于向所述中心节点1发送用于指示所述第三数据节点4出现故障的子树接管请求;The fourth data node 5 is configured to send, to the central node 1, a subtree takeover request for indicating that the third data node 4 is faulty;
所述中心节点1,用于接收所述子树接管请求,将所述子树路径归属表中归属于所述第三数据节点4的子树的归属数据节点更新为第四数据节点5,并向所述第四数据节点5返回所述更新之前所述子树路径归属表中归属于所述第三数据节点4的子树信息;The central node 1 is configured to receive the subtree takeover request, and update the home data node of the subtree belonging to the third data node 4 in the subtree path attribution table to the fourth data node 5, and Returning, to the fourth data node 5, subtree information belonging to the third data node 4 in the subtree path attribution table before the update;
所述第四数据节点5,还用于基于所述第三数据节点4的子树信息在所述第四数据节点的数据缓存中进行子树重建。The fourth data node 5 is further configured to perform subtree reconstruction in a data cache of the fourth data node based on the subtree information of the third data node 4.
在本发明实施例中,该分布式文件系统中的各个数据节点之间可相互通信。该中心节点、第一数据节点、第二数据节点、第三数据节点、第四数据节点等两两之间可以为同一 数据节点,或者为不同的节点,本发明实施例不做限定。具体的,本发明实施例中的中心节点、第一数据节点、第二数据节点、第三数据节点、第四数据节点可参照上述图1-9对应实施例中的相关描述,此处不再赘述。In the embodiment of the present invention, each data node in the distributed file system can communicate with each other. The central node, the first data node, the second data node, the third data node, the fourth data node, etc. may be the same The data nodes, or different nodes, are not limited in the embodiment of the present invention. Specifically, the central node, the first data node, the second data node, the third data node, and the fourth data node in the embodiment of the present invention may refer to the related description in the corresponding embodiment in FIG. 1-9, and no longer Narration.
请参见图12,图12是本发明实施例提供的一种数据服务器的结构示意图。具体的,如图12所示,本发明实施例的所述数据服务器包括:通信接口300、存储器200和处理器100,所述处理器100分别与所述通信接口300及所述存储器200连接。所述存储器200可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。所述通信接口300、存储器200以及处理器100之间可以通过总线进行数据连接,也可以通过其他方式数据连接。本实施例中以总线连接进行说明。具体的,本发明实施例中的所述数据服务器可与上述图1至图11对应实施例中的中心节点相对应,其具体可以是与分布式文件系统中的数据节点如MDS或DS,具体请参照图1至图11对应实施例中的相关描述。其中,Referring to FIG. 12, FIG. 12 is a schematic structural diagram of a data server according to an embodiment of the present invention. Specifically, as shown in FIG. 12, the data server in the embodiment of the present invention includes: a communication interface 300, a memory 200, and a processor 100, and the processor 100 is respectively connected to the communication interface 300 and the memory 200. The memory 200 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory. The communication interface 300, the memory 200, and the processor 100 may be connected to each other through a bus, or may be connected by other means. In the present embodiment, a bus connection will be described. Specifically, the data server in the embodiment of the present invention may correspond to the central node in the corresponding embodiment of FIG. 1 to FIG. 11 , which may specifically be a data node in the distributed file system, such as an MDS or a DS. Please refer to the related description in the corresponding embodiments of FIG. 1 to FIG. among them,
所述存储器200用于存储驱动软件;The memory 200 is configured to store driver software;
所述处理器100从所述存储器读取所述驱动软件并在所述驱动软件的作用下执行:The processor 100 reads the driver software from the memory and executes it under the action of the driver software:
确定分布式文件系统中的子树路径;Determining the subtree path in the distributed file system;
获取所述子树路径上的节点归属信息,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点;Obtaining node attribution information on the subtree path, where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree;
存储所述子树路径上的节点归属信息。The node attribution information on the subtree path is stored.
可选的,所述处理器100在所述驱动软件的作用下执行所述存储所述子树路径上的节点归属信息,具体执行以下步骤:Optionally, the processor 100 performs the storing the node attribution information on the subtree path by using the driving software, and specifically performing the following steps:
生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表;Generating a subtree path attribution table including node attribution information on the subtree path, and storing the subtree path attribution table;
其中,所述子树路径上的子树间的连接信息包括该子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。可选的,该子树路径归属表中还可包括该子树路径上的子树根节点信息等等。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node. Optionally, the subtree path attribution table may further include subtree root node information and the like on the subtree path.
可选的,子树路径归属表还可存储有该子树路径上的子树根节点信息。Optionally, the subtree path attribution table may also store subtree root node information on the subtree path.
可选的,所述处理器100在所述驱动软件的作用下,还用于执行以下步骤:Optionally, the processor 100 is further configured to perform the following steps by using the driver software:
当归属于第一数据节点的目标子树需要迁移到第二数据节点时,获取所述目标子树的索引节点号;Obtaining an index node number of the target subtree when the target subtree belonging to the first data node needs to be migrated to the second data node;
从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;Finding an index node number that is the same as an index node number of the target subtree from the subtree path attribution table;
将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。Updating the home data node corresponding to the found index node number in the subtree path attribution table to the second data node.
可选的,所述处理器100在所述驱动软件的作用下,还用于执行以下步骤:Optionally, the processor 100 is further configured to perform the following steps by using the driver software:
通过所述通信接口300接收指示第三数据节点出现故障的子树接管请求;Receiving, by the communication interface 300, a subtree takeover request indicating that the third data node is faulty;
响应所述子树接管请求,将所述子树路径归属表中归属于所述第三数据节点的子树的归属数据节点更新为第四数据节点;Responding to the subtree takeover request, updating the home data node of the subtree belonging to the third data node in the subtree path attribution table to a fourth data node;
通过所述通信接口300向所述第四数据节点返回所述更新之前所述子树路径归属表中归属于所述第三数据节点的子树信息,以使所述第四数据节点在所述第四数据节点的数据缓存中进行子树重建。 Returning, by the communication interface 300, the subtree information belonging to the third data node in the subtree path attribution table before the update to the fourth data node, so that the fourth data node is in the Subtree reconstruction is performed in the data cache of the fourth data node.
在本发明实施中,可通过检测分布式文件系统中的子树路径,获取包括该子树路径上的子树间的连接信息以及每一个子树的归属数据节点的节点归属信息,并存储该节点归属信息,使得能够基于该存储的节点归属信息进行子树间的迁移和子树故障时的接管流程,从而解决了数据节点如MDS故障时子树接管时间长的问题,并降低了对MDS的子树信息的一致性要求。In the implementation of the present invention, by detecting a subtree path in the distributed file system, obtaining connection information between subtrees on the subtree path and node attribution information of the home data node of each subtree, and storing the node information The node attribution information enables the migration between the subtrees and the takeover process when the subtrees are faulty based on the stored node attribution information, thereby solving the problem that the subtree takes over a long time when the data node is in an MDS failure, and reduces the MDS. Sub-tree information consistency requirements.
在本发明所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division. In actual implementation, there may be another division manner, for example, multiple modules or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
所述该作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated. The components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. . Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of hardware plus software function modules.
上述以软件功能模块的形式实现的集成的模块,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-described integrated modules implemented in the form of software function modules can be stored in a computer readable storage medium. The software function modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods of the various embodiments of the present invention. Part of the steps. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, and the program code can be stored. Medium.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims (14)

  1. 一种数据存储方法,其特征在于,包括:A data storage method, comprising:
    确定分布式文件系统中的子树路径;Determining the subtree path in the distributed file system;
    获取所述子树路径上的节点归属信息,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点;Obtaining node attribution information on the subtree path, where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree;
    存储所述子树路径上的节点归属信息。The node attribution information on the subtree path is stored.
  2. 根据权利要求1所述的方法,其特征在于,所述存储所述子树路径上的节点归属信息,包括:The method according to claim 1, wherein the storing the node attribution information on the subtree path comprises:
    生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表;Generating a subtree path attribution table including node attribution information on the subtree path, and storing the subtree path attribution table;
    其中,所述子树路径上的子树间的连接信息包括所述子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    当归属于第一数据节点的目标子树需要迁移到第二数据节点时,获取所述目标子树的索引节点号;Obtaining an index node number of the target subtree when the target subtree belonging to the first data node needs to be migrated to the second data node;
    从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;Finding an index node number that is the same as an index node number of the target subtree from the subtree path attribution table;
    将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。Updating the home data node corresponding to the found index node number in the subtree path attribution table to the second data node.
  4. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    接收指示第三数据节点出现故障的子树接管请求;Receiving a subtree takeover request indicating that the third data node is faulty;
    响应所述子树接管请求,将所述子树路径归属表中归属于所述第三数据节点的子树的归属数据节点更新为第四数据节点;Responding to the subtree takeover request, updating the home data node of the subtree belonging to the third data node in the subtree path attribution table to a fourth data node;
    向所述第四数据节点返回所述更新之前所述子树路径归属表中归属于所述第三数据节点的子树信息,以使所述第四数据节点在所述第四数据节点的数据缓存中进行子树重建。Returning, to the fourth data node, subtree information belonging to the third data node in the subtree path attribution table before the updating, so that the data of the fourth data node in the fourth data node Subtree reconstruction in the cache.
  5. 一种数据存储系统,其特征在于,包括:第一数据节点以及中心节点;其中,A data storage system, comprising: a first data node and a central node; wherein
    所述第一数据节点,用于为所述第一数据节点的子树提供数据服务;The first data node is configured to provide a data service for a subtree of the first data node;
    所述中心节点,用于确定所述第一数据节点所在的分布式文件系统中的子树路径;获取所述子树路径上的节点归属信息,并存储所述子树路径上的节点归属信息;The central node is configured to determine a subtree path in the distributed file system where the first data node is located; acquire node attribution information on the subtree path, and store node attribution information on the subtree path ;
    其中,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点。The node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree.
  6. 根据权利要求5所述的系统,其特征在于,The system of claim 5 wherein:
    所述中心节点,还用于生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表;The central node is further configured to generate a subtree path attribution table including node attribution information on the subtree path, and store the subtree path attribution table;
    其中,所述子树路径上的子树间的连接信息包括所述子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  7. 根据权利要求6所述的系统,其特征在于,所述系统还包括:第二数据节点;其中,The system of claim 6 wherein said system further comprises: a second data node; wherein
    所述第一数据节点,还用于向所述中心节点发送迁移通知消息,所述迁移通知消息 指示了需要迁移的目标子树的索引节点号以及需要迁移到的第二数据节点;The first data node is further configured to send a migration notification message to the central node, where the migration notification message is Indicates the inode number of the target subtree that needs to be migrated and the second data node that needs to be migrated to;
    所述中心节点,还用于从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。The central node is further configured to search, from the subtree path attribution table, an index node number that is the same as an index node number of the target subtree; and the found index node in the subtree path attribution table The home data node corresponding to the number is updated to the second data node.
  8. 根据权利要求7所述的系统,其特征在于,The system of claim 7 wherein:
    所述第二数据节点,还用于构建数据缓存,并将所述缓存中所述目标子树的归属数据节点更新为所述第二数据节点;The second data node is further configured to construct a data cache, and update a home data node of the target subtree in the cache to the second data node;
    所述第一数据节点,还用于将所述第一数据节点的数据缓存中所述目标子树的归属数据节点更新为所述第二数据节点。The first data node is further configured to update a home data node of the target subtree in a data cache of the first data node to the second data node.
  9. 根据权利要求6所述的系统,其特征在于,所述系统还包括:第三数据节点和第四数据节点,所述第三数据节点为出现故障的数据节点,所述第四数据节点为用于接管所述第三数据节点的子树的数据节点;其中,The system of claim 6 wherein said system further comprises: a third data node and a fourth data node, said third data node being a failed data node, said fourth data node being a data node that takes over the subtree of the third data node; wherein
    所述第四数据节点,用于向所述中心节点发送用于指示所述第三数据节点出现故障的子树接管请求;The fourth data node is configured to send, to the central node, a subtree takeover request for indicating that the third data node is faulty;
    所述中心节点,还用于接收所述子树接管请求,将所述子树路径归属表中归属于所述第三数据节点的子树的归属数据节点更新为所述第四数据节点;并向所述第四数据节点返回所述更新之前所述子树路径归属表中归属于所述第三数据节点的子树信息;The central node is further configured to receive the subtree takeover request, and update a home data node of the subtree belonging to the third data node in the subtree path attribution table to the fourth data node; Returning, to the fourth data node, subtree information belonging to the third data node in the subtree path attribution table before the updating;
    所述第四数据节点,还用于基于所述第三数据节点的子树信息在所述第四数据节点的数据缓存中进行子树重建。The fourth data node is further configured to perform subtree reconstruction in a data cache of the fourth data node based on the subtree information of the third data node.
  10. 一种数据存储装置,其特征在于,包括:A data storage device, comprising:
    路径确定模块,用于确定分布式文件系统中的子树路径;a path determining module, configured to determine a subtree path in the distributed file system;
    第一获取模块,用于获取所述子树路径上的节点归属信息,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点;a first acquiring module, configured to acquire node attribution information on the subtree path, where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree;
    存储模块,用于存储所述子树路径上的节点归属信息。a storage module, configured to store node attribution information on the subtree path.
  11. 根据权利要求10所述的装置,其特征在于,所述存储模块具体用于:The device according to claim 10, wherein the storage module is specifically configured to:
    生成包括所述子树路径上的节点归属信息的子树路径归属表,并存储所述子树路径归属表;Generating a subtree path attribution table including node attribution information on the subtree path, and storing the subtree path attribution table;
    其中,所述子树路径上的子树间的连接信息包括所述子树路径上每一个子树节点的索引节点号、每一个子树节点的上一子树节点的索引节点号。The connection information between the subtrees on the subtree path includes an inode number of each subtree node on the subtree path, and an inode number of a previous subtree node of each subtree node.
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括:The device according to claim 11, wherein the device further comprises:
    第二获取模块,用于当归属于第一数据节点的目标子树需要迁移到第二数据节点时,获取所述目标子树的索引节点号;a second acquiring module, configured to acquire an index node number of the target subtree when a target subtree belonging to the first data node needs to be migrated to the second data node;
    查找模块,用于从所述子树路径归属表中查找出与所述目标子树的索引节点号相同的索引节点号;a searching module, configured to search, from the subtree path attribution table, an inode number that is the same as an inode number of the target subtree;
    第一更新模块,用于将所述子树路径归属表中该查找出的索引节点号对应的归属数据节点更新为所述第二数据节点。And a first update module, configured to update the home data node corresponding to the found index node number in the subtree path attribution table to the second data node.
  13. 根据权利要求11所述的装置,其特征在于,所述装置还包括:The device according to claim 11, wherein the device further comprises:
    接收模块,用于接收指示第三数据节点出现故障的子树接管请求;a receiving module, configured to receive a subtree takeover request indicating that the third data node is faulty;
    第二更新模块,用于响应所述子树接管请求,将所述子树路径归属表中归属于所述 第三数据节点的子树的归属数据节点更新为第四数据节点;a second update module, configured to respond to the subtree takeover request, to attribute the subtree path attribution table to the The home data node of the subtree of the third data node is updated to the fourth data node;
    发送模块,用于向所述第四数据节点返回所述更新之前所述子树路径归属表中归属于所述第三数据节点的子树信息,以使所述第四数据节点在所述第四数据节点的数据缓存中进行子树重建。a sending module, configured to return, to the fourth data node, subtree information belonging to the third data node in the subtree path attribution table before the updating, so that the fourth data node is in the Subtree reconstruction is performed in the data cache of the four data nodes.
  14. 一种数据服务器,其特征在于,包括:存储器和处理器,所述存储器与所述处理器连接;其中,A data server, comprising: a memory and a processor, wherein the memory is connected to the processor; wherein
    所述存储器用于存储驱动软件;The memory is used to store driver software;
    所述处理器用于从所述存储器读取所述驱动软件并在所述驱动软件的作用下执行:The processor is configured to read the driver software from the memory and execute the function of the driver software:
    确定分布式文件系统中的子树路径;Determining the subtree path in the distributed file system;
    获取所述子树路径上的节点归属信息,所述节点归属信息指示了所述子树路径上的子树间的连接信息以及每一个子树的归属数据节点;Obtaining node attribution information on the subtree path, where the node attribution information indicates connection information between subtrees on the subtree path and a home data node of each subtree;
    存储所述子树路径上的节点归属信息。 The node attribution information on the subtree path is stored.
PCT/CN2017/082141 2016-09-30 2017-04-27 Data storage method, device and system WO2018058949A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610866797.2A CN106446197B (en) 2016-09-30 2016-09-30 A kind of date storage method, apparatus and system
CN201610866797.2 2016-09-30

Publications (1)

Publication Number Publication Date
WO2018058949A1 true WO2018058949A1 (en) 2018-04-05

Family

ID=58171408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/082141 WO2018058949A1 (en) 2016-09-30 2017-04-27 Data storage method, device and system

Country Status (2)

Country Link
CN (1) CN106446197B (en)
WO (1) WO2018058949A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125024A (en) * 2019-11-29 2020-05-08 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for deleting distributed system files
CN111176916A (en) * 2019-12-20 2020-05-19 国久大数据有限公司 Data storage fault diagnosis method and system
CN112131223A (en) * 2020-09-24 2020-12-25 曙光网络科技有限公司 Traffic classification statistical method, device, computer equipment and storage medium
EP3995972A4 (en) * 2019-07-05 2022-10-19 ZTE Corporation Metadata processing method and apparatus, and computer-readable storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446197B (en) * 2016-09-30 2019-11-19 华为数字技术(成都)有限公司 A kind of date storage method, apparatus and system
WO2019000386A1 (en) 2017-06-30 2019-01-03 Microsoft Technology Licensing, Llc Online schema change of range-partitioned index in distributed storage system
WO2019000388A1 (en) * 2017-06-30 2019-01-03 Microsoft Technology Licensing, Llc Staging anchor trees for improved concurrency and performance in page range index management
CN107798104A (en) * 2017-10-31 2018-03-13 郑州云海信息技术有限公司 A kind of catalog management method, device, equipment and computer-readable recording medium
CN108153842B (en) * 2017-12-18 2021-11-26 青岛科技大学 Abstract fault tree-oriented structure synthesis method
CN109962797B (en) * 2017-12-23 2022-01-11 华为技术有限公司 Storage system and method for pushing business view
WO2021017655A1 (en) * 2019-07-30 2021-02-04 华为技术有限公司 Method, apparatus, and computing device for obtaining inode number, and storage medium
CN110659249A (en) * 2019-09-25 2020-01-07 浪潮电子信息产业股份有限公司 Metadata subtree migration method, device and equipment and readable storage medium
CN113342780A (en) * 2021-06-28 2021-09-03 深圳壹账通智能科技有限公司 DSU data migration method and device and computer equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344882A (en) * 2007-07-10 2009-01-14 中国移动通信集团公司 Data query method, insertion method and deletion method
CN102238202A (en) * 2010-04-23 2011-11-09 华为技术有限公司 Method and device for storing and searching index information
CN103020315A (en) * 2013-01-10 2013-04-03 中国人民解放军国防科学技术大学 Method for storing mass of small files on basis of master-slave distributed file system
CN104239528A (en) * 2014-09-19 2014-12-24 深圳市心讯网络科技有限公司 File storage system and file storage path recording method
WO2015122905A1 (en) * 2014-02-14 2015-08-20 Hewlett-Packard Development Company, L.P. Assign placement policy to segment set
CN105635310A (en) * 2016-01-20 2016-06-01 杭州宏杉科技有限公司 Access method and device for storage resource
CN106446197A (en) * 2016-09-30 2017-02-22 华为数字技术(成都)有限公司 Data storage method, device and system

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100595761C (en) * 2007-12-29 2010-03-24 中国科学院计算技术研究所 Metadata management method for splitting name space
CN101577735B (en) * 2009-06-24 2012-04-25 成都市华为赛门铁克科技有限公司 Method, device and system for taking over fault metadata server
CN101692239B (en) * 2009-10-19 2012-10-03 浙江大学 Method for distributing metadata of distributed type file system
CN101697168B (en) * 2009-10-22 2011-10-19 中国科学技术大学 Method and system for dynamically managing metadata of distributed file system
CN104272707B (en) * 2012-04-27 2018-04-06 交互数字专利控股公司 The method and apparatus for supporting neighbouring discovery procedure
CN103688257B (en) * 2012-11-27 2017-04-26 华为技术有限公司 Method and device for managing metadata
CN103150394B (en) * 2013-03-25 2014-07-23 中国人民解放军国防科学技术大学 Distributed file system metadata management method facing to high-performance calculation
CN103218175B (en) * 2013-04-01 2015-10-28 无锡成电科大科技发展有限公司 The cloud storage platform access control system of many tenants
CN103279568A (en) * 2013-06-18 2013-09-04 无锡紫光存储系统有限公司 System and method for metadata management
CN103685453B (en) * 2013-09-11 2016-08-03 华中科技大学 The acquisition methods of metadata in a kind of cloud storage system
CN103544322A (en) * 2013-11-08 2014-01-29 北京邮电大学 Hotspot metadata management method based on server cluster
CN103793534B (en) * 2014-02-28 2017-09-08 苏州博纳讯动软件有限公司 Distributed file system and balanced metadata storage and the implementation method for accessing load
CN105550371A (en) * 2016-01-27 2016-05-04 华中科技大学 Big data environment oriented metadata organization method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344882A (en) * 2007-07-10 2009-01-14 中国移动通信集团公司 Data query method, insertion method and deletion method
CN102238202A (en) * 2010-04-23 2011-11-09 华为技术有限公司 Method and device for storing and searching index information
CN103020315A (en) * 2013-01-10 2013-04-03 中国人民解放军国防科学技术大学 Method for storing mass of small files on basis of master-slave distributed file system
WO2015122905A1 (en) * 2014-02-14 2015-08-20 Hewlett-Packard Development Company, L.P. Assign placement policy to segment set
CN104239528A (en) * 2014-09-19 2014-12-24 深圳市心讯网络科技有限公司 File storage system and file storage path recording method
CN105635310A (en) * 2016-01-20 2016-06-01 杭州宏杉科技有限公司 Access method and device for storage resource
CN106446197A (en) * 2016-09-30 2017-02-22 华为数字技术(成都)有限公司 Data storage method, device and system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3995972A4 (en) * 2019-07-05 2022-10-19 ZTE Corporation Metadata processing method and apparatus, and computer-readable storage medium
CN111125024A (en) * 2019-11-29 2020-05-08 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for deleting distributed system files
CN111125024B (en) * 2019-11-29 2022-05-24 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for deleting distributed system files
CN111176916A (en) * 2019-12-20 2020-05-19 国久大数据有限公司 Data storage fault diagnosis method and system
CN111176916B (en) * 2019-12-20 2023-04-07 国久大数据有限公司 Data storage fault diagnosis method and system
CN112131223A (en) * 2020-09-24 2020-12-25 曙光网络科技有限公司 Traffic classification statistical method, device, computer equipment and storage medium
CN112131223B (en) * 2020-09-24 2024-02-02 曙光网络科技有限公司 Traffic classification statistical method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN106446197B (en) 2019-11-19
CN106446197A (en) 2017-02-22

Similar Documents

Publication Publication Date Title
WO2018058949A1 (en) Data storage method, device and system
CN108170768B (en) Database synchronization method, device and readable medium
US20200341867A1 (en) Method and apparatus for restoring data from snapshots
US10331641B2 (en) Hash database configuration method and apparatus
CN106547859B (en) Data file storage method and device under multi-tenant data storage system
JP6404907B2 (en) Efficient read replica
US20180300385A1 (en) Systems and methods for database zone sharding and api integration
EP3352433B1 (en) Node connection method and distributed computing system
US8458299B2 (en) Metadata management method for NAS global namespace design
JP2020038623A (en) Method, device, and system for storing data
US9697226B1 (en) Network system to distribute chunks across multiple physical nodes
CN102708165B (en) Document handling method in distributed file system and device
JP6264666B2 (en) Data storage method, data storage device, and storage device
CN109582213B (en) Data reconstruction method and device and data storage system
US20150254320A1 (en) Using colocation hints to facilitate accessing a distributed data storage system
CN107368369B (en) Distributed container management method and system
TWI734744B (en) Method, device and system for synchronizing routing table
CN111049928B (en) Data synchronization method, system, electronic device and computer readable storage medium
JP6968876B2 (en) Expired backup processing method and backup server
CN103501319A (en) Low-delay distributed storage system for small files
CN113360456B (en) Data archiving method, device, equipment and storage medium
CN109407975B (en) Data writing method, computing node and distributed storage system
US20150169623A1 (en) Distributed File System, File Access Method and Client Device
CN115774703A (en) Information processing method and device
CN109254958B (en) Distributed data reading and writing method, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17854407

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17854407

Country of ref document: EP

Kind code of ref document: A1