WO2020010503A1 - 基于多层一致性哈希的分布式数据存储方法与系统 - Google Patents

基于多层一致性哈希的分布式数据存储方法与系统 Download PDF

Info

Publication number
WO2020010503A1
WO2020010503A1 PCT/CN2018/095083 CN2018095083W WO2020010503A1 WO 2020010503 A1 WO2020010503 A1 WO 2020010503A1 CN 2018095083 W CN2018095083 W CN 2018095083W WO 2020010503 A1 WO2020010503 A1 WO 2020010503A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
storage
transaction
nodes
list
Prior art date
Application number
PCT/CN2018/095083
Other languages
English (en)
French (fr)
Inventor
郝斌
Original Assignee
深圳花儿数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳花儿数据技术有限公司 filed Critical 深圳花儿数据技术有限公司
Priority to PCT/CN2018/095083 priority Critical patent/WO2020010503A1/zh
Priority to CN201880005526.5A priority patent/CN110169040B/zh
Priority to US17/059,468 priority patent/US11461203B2/en
Publication of WO2020010503A1 publication Critical patent/WO2020010503A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers

Definitions

  • the invention relates to a distributed storage system, in particular to a distributed data storage system with fault tolerance and automatic load balancing.
  • a distributed storage system stores data on multiple independent devices in a distributed manner.
  • Traditional network storage systems mostly use centralized storage servers to store all data.
  • Storage servers have become a bottleneck in system performance, and they are also the focus of reliability and security. They cannot meet the needs of large-scale storage applications.
  • the distributed network storage system adopts a scalable system structure, which uses multiple storage servers to share the storage load, and uses a location server to locate the stored information. It improves the reliability, availability, and access efficiency of the system to a certain extent.
  • problem problem.
  • Distributed storage systems need to use multiple servers to store data together. However, as the number of servers increases, the probability of server failures continues to increase. Especially for large data storage systems, failures are inevitable. To ensure that the system is still available in the event of a server failure.
  • the general practice is to divide one piece of data into multiple shares and store them on different servers. However, due to faults and parallel storage, there may be inconsistencies between multiple copies of the same data.
  • Distributed storage systems require multiple servers to work simultaneously. When the number of servers increases, some of them failing is inevitable. Such a situation will affect the entire system. After some nodes in the system fail, make sure that the whole system does not affect the client's read / write requests, that is, the availability of the system must be guaranteed.
  • the invention provides a distributed data storage method and system based on multi-layer consistent hashing, which solves the problems of poor fault tolerance and poor load balancing of the distributed data storage system.
  • a distributed data storage system based on multiple layers of consistent hashing, including: multiple storage nodes that provide data storage and redundancy protection; multiple management nodes that maintain storage node attributes and virtual group-storage node mapping information; multiple One monitoring node that maintains the state of the storage node and handles the addition, deletion, and failure of storage nodes; one or more clients that provide applications or user access points to the storage system.
  • the storage node attributes include: node identification (ID), parent node identification (ParentID), layer type (LayerType), storage node capacity weight values (Weight), node virtual identification (VIDs), node identification host ID (ServerID), The rack ID (RackID), the cluster ID (ClusterID) to which the node belongs, IP, Port, and node status (Status).
  • Storage nodes form a storage architecture tree based on the attributes of each storage node.
  • the tree has multiple layers, and each layer contains different kinds of nodes.
  • the root layer represents the entire storage cluster; the device layer (storage node) is located at the bottom of the tree (leaf nodes). , Is the destination for data storage.
  • Each layer of the tree (except leaf nodes, which has no children) is the parent of its direct children, and the weight of the parent is equal to the sum of the weights of all its direct children.
  • the management node maintains the virtual group-storage node mapping information based on the consistent hash, including: mapping information from the virtual group to the qualified storage node list; mapping information from the virtual group and the failed storage node to the replacement node corresponding to the failed storage node.
  • a Virtual Group corresponds to one partition of a hash space, that is, a hash subspace.
  • Storage nodes exchange their states through heartbeat messages in two ways: during data transmission, the forwarding request from the primary node to the secondary node is also considered as a heartbeat packet; when idle, the primary node to the secondary node that does not contain any object data The message will be used as a heartbeat packet, and the secondary node sends a reply to the primary node to declare that it is online.
  • the distributed data storage method has a fault detection process.
  • the fault is detected through the heartbeat mechanism, which improves the system's fault tolerance, ensures the reliability and availability of the storage process, and avoids the storage data caused by storage node failures.
  • the appearance of inconsistencies In order to evenly distribute data to storage nodes, a multi-layer consistent hash algorithm is used, which avoids the problem of load imbalance when accessing the system.
  • FIG. 1 is a schematic diagram of components included in an exemplary embodiment of the present invention, including a storage node, a management node, a monitoring node, and a client;
  • FIG. 2 is an exemplary hierarchical tree structure diagram of a storage node based on multi-layer consistent hashing
  • FIG. 3 is a schematic diagram of a master node selection process in an exemplary embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an auxiliary node selection process in an exemplary embodiment of the present invention.
  • FIG. 5 is a virtual group-storage node mapping table after an initial data placing process in an exemplary embodiment of the present invention
  • FIG. 6 is a flowchart of a fault detection algorithm in an exemplary embodiment of the present invention.
  • FIG. 7 is a node mapping table when a storage node fails in an exemplary embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a temporary fault repairing algorithm of a repairing part of a distributed hash table in an exemplary embodiment of the present invention
  • FIG. 9 is a schematic diagram of a permanent failure repair method of a distributed hash table repair part in an exemplary embodiment of the present invention.
  • FIG. 10 is a schematic diagram of an algorithm for adding a node in an exemplary embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a node deletion algorithm in an exemplary embodiment of the present invention.
  • a distributed object storage method and system based on multi-layer consistent hashing include: multiple storage nodes providing data storage and redundancy protection; multiple management nodes maintaining storage node attributes and virtual group-storage node mapping information; Multiple monitoring nodes that maintain the status of storage nodes and handle changes in status such as storage node additions, deletions, and failures; and one or more clients that provide applications or users with access to the storage system access point.
  • FIG. 1 is a schematic diagram of components included in an exemplary embodiment of the present invention, including a storage node (SN), a management node (MN), a monitoring node (MoN), and a client.
  • the first step to storing an object is to retrieve the storage node responsible for holding the object.
  • the present invention uses two-stage mapping to map an object to its target storage node.
  • the first-level mapping is to map an object identifier to a virtual group (VG), which is called the first mapping.
  • VG is a hash subspace owned by a virtual node.
  • the virtual nodes are evenly distributed throughout the hash space, and the hash space is uniformly divided into multiple subspaces. We consider VG and virtual nodes to be equivalent.
  • Each (physical) storage node stores multiple VGs, and each VG stores multiple data objects.
  • mapping an object to a VG, or hash subspace includes two steps:
  • Hash calculation Use Key to represent the object ID, and the client to calculate the hash value of the object ID:
  • VG_NUMBER the total number of VGs
  • the client maps objects to VGs through the modulo VG_NUMBER operation:
  • VG_ID ObjectHash% VG_NUMBER
  • VG is a basic unit for data copy or code storage and storage node selection.
  • VG_NUMBER should be large enough, for example, 100 times the number of storage nodes. Multiple VGs may be located in the same storage node. In fact, VG_NUMBER is only related to the load balancing granularity in the present invention, and is not directly related to the mapping of VG to storage nodes.
  • Second mapping Map the VG to the final storage node of the object.
  • the present invention designs a storage load balancing mechanism based on multi-layer consistent hashing to implement the mapping.
  • weight 1.0 means the node has 1T storage capacity
  • weight 2.0 means it has 2T storage capacity
  • Weights represent the relative storage capacity of a node compared to other nodes.
  • the weight of the nodes in one layer is equal to the sum of the weights of all nodes in the lower layer.
  • the second mapping has two sub-processes: primary node selection and auxiliary node selection.
  • the primary node receives requests from clients and distributes the requests to secondary nodes. Both primary node selection and secondary node selection are performed in a hierarchical tree structure. There are nodes on each level of the hierarchical tree, and each node has some predefined attributes.
  • the storage node attribute table in the exemplary embodiment of the present invention includes: a node identifier (ID), a parent node identifier (ParentID), a layer type (LayerType), a storage node capacity weight value (Weight), a node virtual identifier (VIDs),
  • the host ID (ServerID) to which the node belongs the rack ID (RackID) to which the node belongs, the cluster ID (ClusterID) to which the node belongs, IP, Port, and node status (Status), etc.
  • the hierarchical relationship between nodes can be determined by ⁇ ID, ParentID>.
  • the device node is at the lowest storage level, which can be a hard disk or a partition, as the ultimate storage destination for data objects. Other storage levels (such as Server, Rack, etc.) identify the network architecture of the device node, and are used for object-to-device node Mapping.
  • FIG. 2 is an exemplary hierarchical tree structure diagram of a storage node based on multi-layer consistent hashing.
  • the selection of nodes should also conform to storage rules.
  • the storage rule includes: a fault domain level, including a disk, a server or a host (Sever), a rack, a cluster, and the like;
  • a storage policy includes: an object The maximum number of copies of a single object stored in a storage device, the maximum number of copies of a single object stored in a server, the maximum number of copies of a single object stored in a rack, the number of copies of a single object stored in a cluster The maximum number of replicas, etc .; load balancing parameters, including: the deviation threshold of storage usage (such as the allowable deviation of the actual storage amount between nodes does not exceed 5% or 10%, etc.); the maximum CPU utilization; the maximum network utilization.
  • the deviation threshold of storage usage such as the allowable deviation of the actual storage amount between nodes does not exceed 5% or 10%, etc.
  • FIG. 3 shows an exemplary VID allocation diagram in the node layer, which is used to explain the master node selection process.
  • Figure 2 shows a cluster architecture diagram. The entire cluster is represented by the top-level root node (root). The root node has two racks, each rack has two hosts / servers, and each host has two disks.
  • a disk is called a device or a storage node, is a special type of node (no sub-layer node), and is a target device for storing data.
  • Device nodes are always at the bottom of the entire cluster.
  • the master node selection uses the Depth First Search (DFS) method, using storage rules as a pruning strategy, and selecting nodes or devices from the root node to the leaf nodes in the cluster hierarchy.
  • DFS Depth First Search
  • the master node selection process is as follows. Before the storage cluster can be used to store data, that is, when the cluster is initialized, the master node selection can be performed by the system administrator.
  • VID_NUM 5 is set according to the weight, as shown in FIG. 3,
  • each node will be assigned multiple VIDs.
  • the number of VIDs (VID_NUM) allocated to each node in each layer is determined according to the node weight, and the weight of each node is equal to the sum of the weights of all the child nodes in its immediate lower layer. For example, assuming unit weight represents 1T storage space, we can allocate 5 VIDs for 1T storage. If a node has 5T storage, it will be assigned 25 VIDs. All immediate children of the parent node will share the same hash ring / space. In this way, all storage devices or nodes in the server (ie, the lower device layer of the server) will share the same hash ring / space.
  • VG virtual group
  • Figure 4 illustrates the process of auxiliary node selection.
  • the master node is used to store the original object (for the multi-copy strategy) and the first block (for erasure coding, network coding, etc.) generated by the encoding algorithm.
  • the auxiliary node is used to store a copy of the original object (multiple copy strategy) or to store other data / parity / encoded blocks (for erasure codes, network coding, etc.) after the object is encoded.
  • two strategies for assisting node selection can be used:
  • each node can be selected multiple times. That is, each primary or secondary node can hold multiple copies or blocks of an object or VG.
  • Strategy (2) will be apparent to those skilled in the art of the invention. Similar to the primary node selection, we still select the secondary node from the root node to the leaf node in the cluster hierarchy in a depth-first search (DFS) manner based on the storage rule and use the storage rule as a pruning strategy.
  • DFS depth-first search
  • Storage rules set limits when selecting secondary nodes. Whether a node can become a candidate for the auxiliary node is determined by whether the candidate conflicts with other selected nodes (if selected) and whether it meets the constraints defined in the storage rules.
  • FIG. 4 is a hierarchical diagram of an exemplary embodiment of selecting a secondary storage node based on the location of the primary node selected in the aforementioned primary node selection process.
  • VID to represent the subspace in each layer of the hash ring. It is assumed that the disk 1 in the server 1 is selected as the VG's master server in the process of selecting the master server. VG is located in the hash subspace of VID11 belonging to server 1.
  • VID should choose the first hash subspace node encountered in a clockwise direction starting from HASH (VG_ID). It is assumed that one of the VIDs of the disk 2 is selected, that is, the hash value of the disk VID immediately follows the HASH (VG_ID). As shown in Figure 4, disk 2 in server 2 is selected as the first secondary node.
  • VG-storage node mapping table after an initial data placement process in an exemplary embodiment of the present invention, and the table setting is based on a hierarchical topology of the cluster nodes.
  • the storage location can be calculated by taking the cluster hierarchy structure, the weight of each node in the cluster, the list of assigned VIDs of each node, and the current storage rules as inputs.
  • This process consists of two sub-processes: primary node selection and secondary node selection. The processing process has been described above and will not be explained here.
  • the VG-storage node mapping can be saved in a distributed memory database to obtain high performance.
  • the VG-storage node mapping table is maintained by the management node.
  • each record or entry of the VG-storage node table contains a VG_ID and a list of corresponding qualified nodes, the latter including a primary node and a secondary node.
  • the client For each object, the client first obtains the VG by calculating the first mapping, and then uses the VG_ID as the primary key to query the VG-storage node table. The client sends the object to the listed node in the storage node list, or from the node in the storage node list. Read the object.
  • the VG-storage node mapping starts with the initial data layout mapping before the cluster starts running or before the storage node does not fail.
  • a master monitoring node is online and maintains the status of all storage nodes. If one monitoring node is offline due to a failure, the other monitoring node can be restarted by the administrator or automatically restarted.
  • a node ID is used as a primary key, and the states of all storage nodes are stored in a distributed memory database to achieve high performance.
  • Monitoring nodes and clients can retrieve the node status of the VG eligible storage node list. The status maintenance of each VG qualified storage node list is coordinated by the monitoring nodes with the help of the primary and secondary nodes. Fault monitoring can be achieved through a heartbeat mechanism.
  • the primary node When in the idle state, that is, when no request message is received from the client, the primary node sends a heartbeat message to each secondary node to declare its primary node location.
  • the primary node When processing a request from a client, the primary node sends a copy of the object or its chunk to the secondary node, which can also be considered a heartbeat message. In this case, the real heartbeat packet is delayed.
  • the primary node of each VG sends heartbeats to all secondary nodes of the VG, and then each secondary node sends a confirmation message to the primary node to declare its alive status. If any secondary node does not reply to the heartbeat confirmation message within a preset time interval, the primary node reports a secondary node failure message to the monitoring node.
  • any secondary node does not receive a heartbeat message within a certain current time interval, the secondary node will report the failure message of the primary node to the monitoring node.
  • VG_Version to indicate the state change of the primary storage node in the VG eligible node list.
  • a master node fails, a temporary master node is subsequently selected, and VG_Version is increased by one.
  • the real heartbeat message contains only VG_Version and no other data.
  • the monitoring node invokes the temporary primary node (TPN) selection algorithm to select the temporary master node (TPN) according to the temporary master node selection principle, thereby failing.
  • the master node selects the master alternative node.
  • the monitoring node selects a secondary replacement node for the failed secondary node.
  • the failed primary node or failed secondary node replaced by the replacement node is called the host node.
  • Temporary master node selection principle The main storage node holds the longest committed transaction list (CTL) and submitted transaction list (STL) before the failure.
  • the new temporary main storage node should maintain the CTL and STL invariance (That is, the selected temporary primary storage node has the longest CTL and STL among all candidate auxiliary nodes).
  • Selection principle Always select the first secondary storage node without any replacement node as the temporary primary node (TPN) with the largest reported transaction ID, the largest committed transaction ID, and the latest VG version. If the old master node returns to the cluster within the time TEMP_FAILURE_INTERVAL, it will be assigned again as the master node.
  • the so-called invariance of the master node means that the master node always maintains or restores the identity of the master node as much as possible.
  • Temporary master node selection The monitoring node selects a new master node from the current online auxiliary storage nodes according to the principle of selecting a temporary master node. If any secondary storage node does not receive a request message from the primary node within a preset time interval, the secondary storage node will report the failure information of the primary node to the monitoring node, and the temporary primary node selection process will start. If any secondary node does not receive any heartbeat or request message from the primary node, the primary-secondary node relationship (PSR) will be considered as invalid by the secondary node.
  • PSR primary-secondary node relationship
  • the auxiliary node that has failed the PSR will report the change message to the monitoring node.
  • the fault report message is composed of the latest submitted transaction ID (LCMI), the latest submitted transaction ID (LSTI), and the current VG version (Current Version, CVV).
  • the monitoring node Upon receiving the report message from the auxiliary node as the primary candidate node, the monitoring node first checks the CVV to exclude candidates with lower VG versions; then checks the LCMI of the current candidate to exclude nodes with lower LCMI; and finally checks LSTI Exclude nodes with lower LSTIs.
  • FIG. 6 is a flowchart of a fault detection algorithm in an exemplary embodiment of the present invention.
  • any node including the primary and secondary nodes, will fail, and the monitor will choose a replacement node to handle the request that should be assigned to the failed node.
  • the selection of replacement nodes will be performed according to the principle of local affinity. In other words, the new selected candidate is first selected in the same layer, for example, first in the same server layer, and then in the same rack layer.
  • the principle of local affinity means that the failed node and the surrogate node may be in the same fault domain. If the candidate in the same fault domain is already in the qualified node list, that is, conflicts with another storage node in this layer, the selection is continued in the upper layer in DFS mode until a replacement node that meets the storage rules is selected. Similar to secondary node selection, the DFS search process uses storage rules as a pruning strategy. If a failed node rejoins the cluster and recovers data quickly, selection using local affinity principles ensures that data is quickly transferred from the replacement node to the rejoined node.
  • FIG. 7 is a VG-replacement node mapping table when a storage node fails in an exemplary embodiment of the present invention.
  • the VG-Replacement Node (Replacement Node) mapping can be implemented through a Distributed Memory Database (DMDB) table.
  • DMDB Distributed Memory Database
  • the VG-replacement node table is stored as metadata in the management node cluster. Any monitoring node can query the table. If any monitoring node fails, it can be restarted on the current or other server. Due to the consistency of the database, the VG-replacement node table is stable (consistent).
  • the VG-replacement node table consists of entries in the format of ⁇ VG_ID, Host Node ID, Replacement Node ID List>, with ⁇ VG_ID, Host Node ID> as the primary key.
  • the host node (Host ID) is the primary or secondary node that fails. For each VG, there may be multiple replacement nodes for each host node, because the replacement node may also fail. When the replacement node fails, another replacement node should be selected for this failed host node. However, there is only one online replacement node at any time. The status of the replacement node and all storage nodes is maintained by the management node.
  • Temporary fault repair and permanent fault repair include two stages: Distributed Hash Table (DHT) repair and data repair. DHT fixes align the mapping from VG to the list of eligible storage nodes with the changes in the storage node member status in the cluster. Data repair moves data to its destination based on the mapping from VG to qualified storage nodes. The mapping from VGs to qualified storage nodes is maintained by the management node, and updates to different locations of different VG entries can be performed simultaneously by different management nodes. This is what "distributed" means in a distributed hash table.
  • DHT Distributed Hash Table
  • Temporary failure fix for distributed hash table DHT.
  • DHT distributed hash table
  • a storage node When a storage node receives a cluster state change indicating that some nodes have been marked as TEMPORARY_FAILURE, it will traverse all the virtual groups (VGs) it stores. Is marked TEMPORARY_RECOVERING. If the failed node is the master node, its responsibility will be taken over by the temporary master node (TPN) selected using the above-mentioned TPN selection algorithm. The VID (usually multiple) of the faulty node will be temporarily shielded, and the TPN will first select a replacement node in the same fault domain of the faulty node according to the principle of local affinity, such as the same server, the same rack, etc.
  • Update operations during the repair will be redirected to the replacement node. If the failed node is recovered within the time TEMPORARY_FAILURE_INTERVAL, the update operation (data) will be moved from the temporary node to the newly recovered node. This is the function of temporary failure recovery of the distributed hash table.
  • FIG. 8 is a schematic diagram of a temporary failure recovery algorithm in an exemplary embodiment of the present invention.
  • the first auxiliary node becomes the TPN of the disk 2 VID11 of the server 2 as shown in FIG. 8.
  • the faulty VG is in the hash space of VID11 in disk 1 of server 1.
  • the monitoring node selects another VID that does not belong to disk 1 and is located in the same device layer. Select the VID that is not in the failed disk 1 and is closest to VID11 in the device layer hash space, so select VID00 located in disk 0.
  • All updates will be redirected to the replacement node in disk 0.
  • the failed master node in Disk 1 After the failed master node in Disk 1 returns, it will be selected as the master node again, and all its VIDs (generally multiple) will be returned to the hash ring.
  • the monitoring node For the temporary failure recovery of the distributed hash table, the monitoring node will update the storage node information table (including status), the VG-storage node mapping table, and the VG-replacement node mapping table.
  • the monitoring node first modifies the storage node table and marks the status of the failed node as OFFLINE; for the VG-qualified storage node mapping table update, the status of the corresponding failed node in the list is also marked as OFFLINE.
  • the failed node will not be selected as the master or replacement node before rejoining the cluster and its status changes to ONLINE and all VIDs assigned to it will be blocked in the hash space.
  • the monitoring node adds an input entry ⁇ VG_ID, failed node ID, replacement node ID> to the VG-replacement node mapping table. There may be multiple VGs stored on the failed node.
  • the monitoring node traverses the VG-storage node mapping table to find a VG with a failed node in its storage qualified node list. For all VGs, ⁇ VG_ID, failed node, replacement node ID> will be inserted into the VG-replacement node mapping table.
  • a monitoring node When a monitoring node receives a cluster state change indicating that the status of some nodes is marked as PERMANENT_FAILURE, it will traverse all VGs. If any node in its qualified node list is in the PERMANENT_FAILURE state, it will mark the status of this VG as PERMANENT_FAILURE_RECOVERING. For monitoring nodes:
  • the upper nodes of the failed node are re-weighted. If necessary, the upper node may block one or more VIDs.
  • VID in the server represents 2 disks, and 2 disks in the server fail and the two disks correspond to a VID, the server will block this VID. If the next VID of this blocked VID in the hash space is on a different server, data migration from that server to another server will occur. If the VID shielding does not occur at the upper layer, according to the consistent hashing principle, the data migration will be between all nodes in the same layer, and the migration speed will be faster than the migration to other layers. This is the advantage of the local affinity principle.
  • FIG. 9 is a schematic diagram of a permanent failure recovery algorithm of a distributed hash table in an exemplary embodiment of the present invention.
  • VG Temporary and permanent failure repair of data.
  • the data repair process is coordinated by the VG's master node (PN) or temporary master node (TPN) or independent repair nodes. These nodes can be monitoring nodes or newly added nodes.
  • PN master node
  • TPN temporary master node
  • independent repair nodes can be monitoring nodes or newly added nodes.
  • VG is the basic repair unit.
  • the master node handles the repair process.
  • some permanent variables are stored for data storage and repair.
  • the initial value of the VG version is 0; when the list of VG eligible nodes changes, it increases monotonically by 1.
  • After a node rejoins the cluster it connects to the storage management node and checks for any replacement nodes. If no replacement node exists, there is no need to repair the data of the reconnected node. If there is a replacement node and one is online, the reporting and commit transaction table is copied from the corresponding replacement node to the rejoining node.
  • the reporting transaction table (STL) and commit transaction table (CTL) can be stored hierarchically, that is, the STL is stored in a high-performance medium (such as an SSD), and the CTL is stored in a large-capacity but low-performance Media.
  • the replacement node neither STL nor CTL is applied (placed) to a persistent storage medium (such as HDD), and they will be applied to their final storage destination, that is, the node that rejoins.
  • a persistent storage medium such as HDD
  • the replacement node will play the role of the failed host node, which means that STL and CTL will be applied to the replacement node.
  • the repair will be delayed up to TEMPORARY_FAILURE_INTERVAL. If the replacement node recovers within the TEMPORARY_FAILURE_INTERVAL time, the rejoining node copies the STL and CTL from the replacement node to the local storage medium. If no replacement node is returned to the cluster within this time period, rejoining the node requires repairing the data previously redirected to the replacement node.
  • the rejoining node sends the last commit transaction ID, PN or TPN to the master node (PN) or the temporary master node (TPN) in reply to its first commit transaction ID (FCTI), and the last commit transaction ID (Last Committed transaction) Id, LCTI), First Submitted Transaction ID (FSTI) and Last Submitted Transaction ID (LSTI). Need to repair / regenerate related object data with transaction ID in the range [Rejoining LCTI, Primary LCTI] through mirror copy, erasure coding or network coding.
  • Transaction ID is in the range [Primary FSTI, Primary LSTI] (when no online replacement node is available)
  • Object data in the range [Rejoining FSTI, Replacement FCTI] (when there are online replacement nodes) needs to be repaired / regenerated to restore the replacement node STL.
  • the management node will reset this newly joined node as Master node, the new master node continues to process requests from clients. If the storage node cannot rejoin the cluster within the time TEMPORARY_FAILURE_INTERVAL, it will be treated as a permanent failed node (PFN). The replacement node of the permanently failed node will be Promoted to the list of qualified nodes, the data lost in the PFN needs to be repaired in this new qualified node.
  • the master or temporary master node traverses all the storage objects in the VG directory and sends the object name list and the auxiliary repair node ID list to the new Qualified nodes. If multiple copies are used, the auxiliary repair node is this master node or temporary master node. If (N, K) erasure code is used, it is the first K online nodes in the VG qualified node list. New qualified Nodes repair faulty objects one by one.
  • the name components in the object name list include but are not limited to Recognize object id, object version, last operation type code, recent transaction ID, transaction ID. Note that objects with lower versions cannot overwrite objects with the same object ID but larger or same versions.
  • the auxiliary repair node is managed by the management node. It is determined that the principle of selecting the online auxiliary repair node is: In the VG eligible node list, first select the node without a replacement node, then select the node with the larger commit transaction ID, and finally select the node with the larger submit transaction ID.
  • Add node process As you add nodes, the cluster organization chart changes. The upper nodes of the added nodes are re-weighted to the root layer. If the added storage space is large enough, nodes on the re-weighted path will be assigned a new VID. Data is migrated from other device domains in the same layer to that device domain. For example: two racks on the same tier, if one rack adds a lot of disks, the data in the other rack will be migrated to that rack for storage load balancing.
  • FIG. 10 is a flowchart of a node adding algorithm in an exemplary embodiment of the present invention. Suppose a new disk 2 is added to server 1 at the device level.
  • the monitoring node randomly assigns VID to disk 2 according to its weight, such as new disk 2 VID21, new disk 2 VID22, and so on.
  • the weight of server 1 should be increased to remain equal to the sum of the weights of all its child nodes, as shown in disks 0, 1, and 2. If enough disks are added to reach the weight threshold of a VID in the upper layer, server 1 needs to randomly allocate an additional VID. Assume that server 1 does not add a new VID, and data migration occurs only between the old and new disks. Because the new disk 2 has a new VID corresponding to the subspace of the hash ring under the server 1, the data will be migrated from disk 0 and / or disk 1 to the new disk 2 to achieve storage load balancing. If enough disks are added to server 1, it may be assigned additional VIDs, which will cause the hash partition in the server layer to change, which will cause data migration within the server layer.
  • the monitoring node is responsible for adding node processing, and the process is as follows:
  • the qualified node list corresponding to the VG is updated in the VG-storage node table, that is, after the VG data is moved to the pre-order node, the pre-order node is used to replace the qualified node list. Successor node in.
  • Remove / discard node process This process is similar to the permanent fault recovery process, except that the fault handling is passive and unpredictable. The process is active and can be handled by the cluster administrator. It does not require data repair, only data migration at this layer or above. When multiple disks are removed and enough VIDs are shielded, the upper-level device at the removal point may require VID adjustments. Due to the characteristics of consistent hashing, data migration only occurs between the node to be removed / abandoned and other storage nodes, and there is no migration between storage nodes that have not been removed.
  • FIG. 11 is a schematic diagram of a remove / discard node processing algorithm in an exemplary embodiment of the present invention. Assume that disk 2 at the device level is removed from server 1 at the server level. The monitoring node should shield all VIDs assigned to disk 2. Disk 2's weight is set to 0, which means that Disk 2 cannot receive any data at this time. In order to keep the sum of the weights of disk 0 and disk 1 of all the child nodes in FIG. 11, the weight of server 1 should be updated, that is, the weight value of disk 2 is subtracted. If enough disks are removed, that is, the weight threshold corresponding to a VID in the server layer is reached, server 1 may need to shield the VID.
  • the pre-order VID For each VID (referred to as the pre-order VID) assigned to the node to be removed, look for a subsequent VID (subsequent VID) in a clockwise order belonging to another node (subsequent node) in the same layer.
  • the pre-order VID is the pre-order node of the subsequent VID in the same layer hash space, and the node to be deleted is on the same layer as the subsequent node.
  • step 2 For each VG found in step 2 that should be moved to the successor node, update the list of eligible nodes corresponding to the VG in the VG-storage node mapping table, that is, after moving the data in the VG to the successor node , Replace the previous node with the successor node in the list of eligible nodes.
  • the system can provide high availability and high reliability.
  • the following describes a data storage service provided in an exemplary embodiment of the present invention.
  • the storage service can be regarded as an abstract storage layer of a distributed storage system, and provides an access port for clients to access the storage system (read or write).
  • we achieve strong consistency ensuring that all the same sequence of operations on a particular object are performed on all nodes in the same order.
  • the same operation on an object can be requested multiple times, but because there is an object update version control, the operation is performed only once. Data consistency is maintained even in the event of a failure. From the perspective of a single client, a distributed storage system can consider that a single storage node is running.
  • Object read, write, update, and delete processes We call file fragments that exist in the client or the master node (using a multi-copy storage strategy), or encoded blocks of file fragments (using erasure coding or network coding storage strategies) as objects.
  • Each encoded block from a file fragment has the same object ID as its local file fragment, and the block ID is obtained based on the storage node position in the VG eligible node list.
  • Each file segment or coded block is stored as a single file in the storage node, and its file name includes but is not limited to: object ID, block ID, version number, and so on.
  • Object writing process Assume that the client wants to store the file to the storage system. The client first divides the file into file segments with a preset size of SEGMENT_SIZE. If the file size is less than SEGMENT_SIZE, you need to append zeros at the end to make the file occupy The entire file segment.
  • the file segment ID (SEGMENT_ID) may be calculated based on a file ID (FILE_ID) assigned to a file by, for example, a management server.
  • the management node cluster maintains and manages metadata of all files.
  • the metadata store may be implemented in a distributed memory database. The file segment ID is calculated based on its monotonic increase from 0 in the file.
  • the first segment of the file FILE_ID has the file segment IDFILE_ID_0
  • the second segment of the file segment ID is FILE_ID_1, and so on.
  • each file segment from a file is copied or encoded into multiple blocks, and then stored in a storage node using two mappings.
  • the data of each block is the same as the file segment; if (K, M) encoding scheme is used, for erasure coding or network coding, the file segment is first divided into K original data blocks, and then encoded to generate M codes / Check block.
  • All blocks generated by the file segment are retrieved by SEGMENT_ID in the storage system layer, and retrieved by the combination of SEGMENT_ID and BLOCK_ID in the local file system of the storage node.
  • SEGMENT_ID in the storage system layer
  • BLOCK_ID in the local file system of the storage node.
  • each block an object. For each file segment,
  • the client uses the hash function to calculate the ID (VG_ID) of the VG, that is, the first mapping from the segment / object to the VG:
  • VG_ID Hash (SEGMENT_ID)% VG_NUMBER.
  • the client retrieves the ID of the primary storage node of the VG from one of the management nodes, and the management node will maintain the mapping from the VG to its qualified storage node list.
  • the selection of the management node can be load balanced by a hash function:
  • the client sends the file segment / object to the master node, and the initial version of the object is 0.
  • the master node After receiving the file segment, the master node searches the VG_ID directory and checks whether an object with the same ID as the new object file segment exists. If so, the master node rejects the request and responds with the current version of the object. If the object does not exist in the VG_ID directory, the master node increases the current VG transaction ID by 1, combines ⁇ TransactionID, VGVersion, ObjectOperationItem> to form a transaction, appends a new transaction to the report transaction table (STL), and adds Increase the length by 1.
  • ObjectOperationCode includes but is not limited to WRITE, READ, UPDATE, DELETE, etc. For write operations, ObjectOperationCode is WRITE.
  • STL can be stored in high-performance media such as SSD log files. In a preferred embodiment of the invention, STL can be implemented using a log grading mechanism (see below).
  • the master node modifies the BlockID in the transaction, and uses the message mechanism to forward the modified transaction to the corresponding auxiliary node.
  • the BlockID in the transaction indicates the auxiliary node.
  • Each object request is called a transaction.
  • the transaction composition generally includes: transaction ID, virtual group ID, virtual group version, object ID, object version, object data, and operation type.
  • each auxiliary node After receiving the transaction / request message, each auxiliary node checks the VG version VGVersion and its current VG version CurrentVGVersion in the transaction. If VGVersion ⁇ CurrentVGVersion, the request is rejected, otherwise, the transaction is attached to the local commit transaction table of this VG (STL) and send a success confirmation message to the master node.
  • the master node After receiving (writing) a successful confirmation message from all the auxiliary nodes, the master node sends a confirmation message to the client to confirm that the requested file segment has been correctly stored in the storage system, and then asynchronously stores the file segment or its The encoded block is saved to the local disk as a single file, which we call a commit COMMIT. At the same time, the master node sends a COMMIT request to all auxiliary nodes to make them persistently store the corresponding blocks.
  • the transaction submission mechanism includes the processing of the main node, the processing of the auxiliary node, and the processing of the replacement node.
  • the transaction corresponding object is a data block (multi-copy policy file or erasure code / uncoded native data unencoded part)
  • the data in the transaction will overwrite the data in the file in the ⁇ OBJECT_OPERATION_OFFSET, OBJECT_OPERATIN_LENGTH> interval.
  • the data in the transaction needs to be merged (exclusive OR operation) with the data in the current file ⁇ offset ObjectOperationOffset, length ObjectOperationLength> interval.
  • the file name corresponding to this transaction was changed to ObjectID.BlockID.ObjectVersion.
  • the master node marks the transaction as committed.
  • the replacement node selection algorithm of the present invention ensures that the primary node obtains a confirmation message from all auxiliary nodes that the transaction is correctly submitted with a high probability.
  • the master node If the master node does not receive all confirmations within a preset interval (COMMIT_TIMEOUT), it marks this transaction as DESUBMITTED and regards it as an invalid transaction, and sends a DESUBMITTED message to notify all secondary nodes to mark this transaction as DESUBMITTED. If the DESUBMITTED transaction is the last entry in the log file, it will be deleted.
  • COMMIT_TIMEOUT If the master node does not receive all confirmations within a preset interval (COMMIT_TIMEOUT), it marks this transaction as DESUBMITTED and regards it as an invalid transaction, and sends a DESUBMITTED message to notify all secondary nodes to mark this transaction as DESUBMITTED. If the DESUBMITTED transaction is the last entry in the log file, it will be deleted.
  • the master node sends a query transaction SEARCH_TRANSACTION message of the format ⁇ VGTransactionID, VGVersion, ObjectID, BlockID, ObjectVersion> to all surviving secondary nodes. If all secondary nodes include this transaction, then The master node can safely commit this transaction, the process is similar to step 2. If none of the secondary nodes have this transaction, the primary node will replicate the transaction to the corresponding missing secondary node again, which should have been completed by the previous primary node.
  • the master node sends a query transaction SEARCH_TRANSACTION message of the format ⁇ VGTransactionID, VGVersion, ObjectID, BlockID, ObjectVersion> to all surviving auxiliary nodes. Including this transaction, the master node can safely commit this transaction, the process is similar to step 2.
  • the primary node will attempt to regenerate (decode and repair) the original file segment and encode the file segment again to regenerate the faulty or undistributed block; when When the redundancy strategy is network coding, the master node collects enough surviving blocks from the surviving auxiliary nodes to regenerate failed / undistributed blocks (no need to recover the entire file segment); the master node regenerates the transaction and forwards again after the faulty block repair is completed Transaction to the corresponding missing secondary node, this transaction should also be completed by the previous primary node.
  • Auxiliary node processing process After receiving a transaction submission request CommitTransaction in the form of ⁇ TransactionID, VGVersion>, each auxiliary node searches for the transaction CommitTransaction in the reported transaction list STL. If it finds a pending transaction that both the TransactionID and VGVersion match, the auxiliary node submits a CommitTransaction (persistently stores the object data in this transaction to the local file system).
  • Replacement node processing process When a transaction submission request with a transaction format of ⁇ TransactionID, VGVersion> is received, the replacement node of the failed host node searches for the transaction in the STL. If a TransactionID and a VGVersion match are found for a pending transaction, the replacement node does not need to commit the transaction, it only needs to mark the transaction as committed COMMITTED. Once the replacement node becomes the primary or secondary node, the transactions in the STL will be committed during the repair of the failed host node. If an update (UPDATE operation) transaction is involved, the host node transaction object may need to be merged with the recovery object in the replacement node.
  • UPDATE operation UPDATE operation
  • the primary node will update the VG eligible node list from the management node. If the update is successful, each failed secondary node will be temporarily replaced by the replacement node. The master node will resend the transaction to the new replacement node. After all nodes including the secondary node and its replacement node return a confirmation message that the transaction was successfully executed, the primary node responds with a message that the client transaction was successful. If the replacement node of the failed node cannot be retrieved from the management node, the master node will repeat the query process indefinitely, and all requests for the object by the client will be rejected.
  • Log grading mechanism In order to achieve low latency, especially for update operations, in a preferred embodiment of the present invention, all transactions are sequentially appended to the end of the submitted log file in a high-performance storage medium such as an SSD, that is, the report transaction list is stored in high-performance SSD.
  • SSDs are expensive, and the capacity at the same price is lower than HDDs, while HDD hard disks are cheap and large.
  • HDD hard disks are cheap and large.
  • the transaction information in the commit log file can be considered as the metadata of the target file stored in the same storage node.
  • Commit log files can be used to speed up the object repair process in the event of a permanent failure. Because the commit log file records all the transactions submitted in the VG, the replacement node of the permanent failure storage node can obtain the commit log file from the current master node or the temporary master node and obtain all the object IDs that need to be repaired. Otherwise, the master node or The temporary master node needs to scan all files in the VG directory, which is very time consuming.
  • CTL Committed transaction list
  • the CTL contains all transaction information describing historical operations of objects stored in this VG. In fact, we only need the latest operation information of each object in the VG. Therefore, we can traverse the CTL, delete duplicate transaction entries for the same object, and keep only the latest transaction that can be determined by the transaction ID and VG version (that is, the transaction with the largest VG version and largest transaction ID). Deduplication of the same object transaction reduces the size of the commit log file, so more object transactions can be recorded with less storage space.
  • the client When the primary storage node is fault-free, the client always reads file segments from the primary storage node. For each file segment,
  • the client calculates VG_ID and uses the first mapping to map the object / file segment ID to VG:
  • VG_ID HASH (SEGMENT_ID)% VG_NUMBER.
  • the client obtains the master node ID from the management node, where the management node maintains a second mapping from VG to a list of VG qualified storage nodes.
  • the client sends a read request to the master node corresponding to this object.
  • the master node then collects K blocks of the file segment from the local memory and K-1 surviving auxiliary nodes, and reconstructs the file segment.
  • the main storage node sends the file segment to the client.
  • the processing information of the read request is transferred to the temporary master node (TPN).
  • TPN temporary master node
  • the TPN will retrieve a copy of the object in local storage and immediately reply to the client.
  • TPN will collect K blocks from the top K surviving auxiliary nodes in the VG eligible node list, reconstruct the original object and send the segment / object to the client. The reading process during a failure is called degraded reading.
  • VG_ID HASH (SEGMENT_ID)% VG_NUMBER.
  • the client obtains the master node ID from the management node, and the management node maintains a second mapping from VG to the VG qualified node table.
  • the client sends the updated data part to the master node corresponding to this object.
  • each auxiliary node After receiving the transaction, each auxiliary node appends the transaction using the same rules as the object write process. If successful, reply to the master node with a confirmation message.
  • the master node obtains the old data corresponding to the position and length ⁇ OFFSET, LENGTH> from other auxiliary nodes that store locally or store the updated part of this object, and calculate the data increment, that is, the XOR of the new and old data :
  • Data_Delta New_Data_at_OFFSET_LENGTH ⁇ Old_Data_at_OFFSET_LENGTH.
  • the master node treats each updated quantity increment Data_Delta as a single file segment / object to calculate the check block increment (Partity_Delta).
  • the master node assembles the Data_Delta into a transaction according to step (7), and attaches the transaction to its local STL or sends it to the auxiliary node responsible for this update part. Then, the master node assembles the transaction of Partity_Delta and forwards it to the corresponding auxiliary node. This process is the same as step (7).
  • the processing of the auxiliary node is the same as step (8).
  • the primary node After receiving a confirmation message from all secondary nodes to successfully execute the transaction, the primary node submits the transaction and sends a transaction submission request to all responding secondary nodes, including all secondary nodes that store Data_Delta and Parity_Deltas. Object updates are stored persistently to the local file system.
  • VG_ID HASH (SEGMENT_ID)% VG_NUMBER.
  • the client obtains the master node ID from the management node, and the management node maintains a second mapping from VG to the VG qualified node table.
  • the client sends a DELETE request containing ⁇ ObjectID, DELETE> to its master node.
  • each auxiliary node After receiving the delete transaction, each auxiliary node appends the transaction using the same rules as the object write process, and if successful, it returns a confirmation message to the primary node.
  • the master node After receiving a transaction execution success confirmation message from all auxiliary nodes, the master node submits a DELETE transaction and sends a transaction submission request to all auxiliary nodes.
  • the primary or secondary node usually does not directly delete the object in the local file system, but only marks the object as DELETEING, for example, adds the DELETING flag to the file name corresponding to the object.
  • the actual delete operation is performed asynchronously by the background process according to a preset policy. For example, the policy usually stipulates that the object is permanently deleted after a period of time.
  • a distributed file storage system can be constructed.
  • the system can provide file operations including, but not limited to, writing, reading, updating, and deleting.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于多层一致性哈希的分布式数据存储方法与系统包含:多个提供数据存储及冗余保护的存储节点;多个维护存储节点属性及虚拟组-存储节点映射信息的管理节点;多个维护存储节点状态并处理存储节点添加、删除和故障等状态变化的监控节点;以及一个或多个为应用或用户提供访问存储系统接入点的客户端。存储节点以树形架构组织,树中每层的每个存储节点被分配多个标识并维护保持一致性的哈希空间。存储节点架构树中存在多个在每层中保持一致的哈希空间,而不是在所有存储节点之间共享一个哈希空间。系统保证了存储过程的可靠性和可用性,避免了访问系统时负载不均衡问题的出现。

Description

基于多层一致性哈希的分布式数据存储方法与系统 技术领域
本发明涉及分布式存储系统,特别涉及一种具有容错和自动负载均衡的分布式数据存储系统。
背景技术
分布式存储系统,是将数据分散存储在多台独立的设备上。传统的网络存储系统大多采用集中的存储服务器存放所有数据,存储服务器成为系统性能的瓶颈,也是可靠性和安全性的焦点,不能满足大规模存储应用的需要。分布式网络存储系统采用可扩展的系统结构,利用多台存储服务器分担存储负荷,利用位置服务器定位存储信息,它在一定程度上提高了系统的可靠性、可用性和存取效率,但还存在以下问题。
分布式存储系统需要使用多台服务器共同存储数据,然而随着服务器数量的增加,服务器出现故障的概率也在不断增加,特别是对于大型数据存储系统,发生故障更是不可避免的。为了保证在有服务器出现故障的情况下系统仍然可用。一般做法是把一个数据分成多份存储在不同的服务器中。但是由于故障和并行存储等情况的存在,同一个数据的多个副本之间可能存在不一致的情况。
分布式存储系统需要多台服务器同时工作。当服务器数量增多时,其中一些服务器出现故障是在所难免的。这样的情况会对整个系统造成影响。在系统中的一部分节点出现故障之后,要确保系统的整体不影响客户端的读/写请求,即系统的可用性要有保证。
其次,除了故障以外,另一个重要的问题是如何将数据均匀地分配给存储节点,避免访问系统时出现热点的负载均衡或数据路由等问题。
发明内容
本发明提供一种基于多层一致性哈希的分布式数据存储方法与系统,解决了分布式数据存储系统容错性差,负载均衡差的问题。
一种基于多层一致性哈希的分布式数据存储系统,包括:多个提供数据存储和冗余保护的存储节点;多个维护存储节点属性及虚拟组-存储节点映射信息的管理节点;多个维护存储节点的状态,处理存储节点的添加、删除和故障的监控节点;一个或多个提供应用程序或用户访问存储系统的 接入点的客户端。
存储节点的属性包括:节点标识(ID),父节点标识(ParentID),层级类型(LayerType),存储节点容量权重值(Weight)、节点虚拟标识(VIDs)、节点所属主机标识(ServerID)、节点所属机架标识(RackID)、节点所属集群标识(ClusterID)、IP、Port和节点状态(Status)。
存储节点基于每个存储节点的属性形成存储架构树,该树有多层,每层包含不同种类的节点,例如,根层表示整个存储集群;设备层(存储节点)位于树底部(叶节点),是数据存储的目的地。树的每一层(叶节点除外,其无子节点)都是其直接子节点的父层,父节点的权重等于其所有直接子节点权重总和。
管理节点维护基于一致性哈希的虚拟组-存储节点映射信息,包括:从虚拟组到合格存储节点列表的映射信息;从虚拟组和故障存储节点到故障存储节点对应替换节点的映射信息。本发明中虚拟组(Virtual Group,VG)对应哈希空间的一个分区,即哈希子空间。
存储节点以两种方式通过心跳消息交换其状态:在进行数据传输时,从主节点到辅助节点的转发请求同时被视为心跳包;空闲时,不包含任何对象数据的主节点到辅助节点的消息将用作心跳包,辅助节点向主节点发送回复以声明处于在线状态。
该分布式数据存储方法设有故障检测过程,通过心跳机制对故障进行检测,提高了系统的容错性,使存储过程的可靠性和可用性得到了保证,避免了由于存储节点故障而导致的存储数据不一致问题的出现。为了将数据均匀地分配给存储节点采用了多层一致性哈希算法,避免了访问系统时负载不均衡问题的出现。
附图说明
图1为本发明的示例性实施例中包含的组件示意图,包括存储节点,管理节点,监控节点和客户端;
图2为基于多层一致性哈希的存储节点的示例性分层树结构图;
图3为本发明的示例性实施例中的主节点选择过程示意图;
图4为本发明的示例性实施例中的辅助节点选择过程示意图;
图5为本发明的示例性实施例中的初始数据放置过程之后的虚拟组-存储节点映射表;
图6为本发明的示例性实施例中的故障检测算法流程图;
图7为本发明的示例性实施例中的存储节点发生故障时,替换节点映 射表;
图8为本发明的示例性实施例中分布式哈希表修复部分的临时故障修复算法示意图;
图9为本发明的示例性实施例中分布式哈希表修复部分的永久故障修复法示意图;
图10为本发明的示例性实施例中的添加节点算法示意图;
图11为本发明的示例性实施例中的删除节点算法示意图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本申请能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他元件、材料、方法所替代。在某些情况下,本申请相关的一些操作并没有在说明书中显示或者描述,这是为了避免本申请的核心部分被过多的描述所淹没,而对于本领域技术人员而言,详细描述这些相关操作并不是必要的,根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。
另外,说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时,方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此,说明书和附图中的各种顺序只是为了清楚描述某一个实施例,并不意味着是必须的顺序,除非另有说明其中某个顺序是必须遵循的。
一种基于多层一致性哈希的分布式对象存储方法与系统包括:多个提供数据存储和冗余保护的存储节点;多个维护存储节点属性及虚拟组-存储节点映射信息的管理节点;多个维护存储节点状态并处理存储节点添加、删除和故障等状态变化的监控节点;以及一个或多个为应用或用户提供访问存储系统接入点的客户端。图1为本发明的示例性实施例中包含的组件示意图,包括存储节点(SN),管理节点(MN),监控节点(MoN)和客户端。
存储对象的第一步是检索负责保存对象的存储节点。本发明采用两阶段映射将对象映射到其目标存储节点。第一层映射是将对象标识符映射到虚拟组(VG),称为第一映射。VG为虚拟节点拥有的哈希子空间,虚拟节点均衡分布于整个哈希空间,将哈希空间统一划分为多个子空间。我们认 为VG和虚拟节点是等同的。每个(物理)存储节点中存储多个VG,每个VG中保存多个数据对象。
第一映射:将对象映射到VG即哈希子空间,包括两个步骤:
(1)哈希计算:用Key表示对象标识,客户端计算对象标识哈希值:
ObjectHash=Hash(Key)
(2)VG映射:设VG_NUMBER为VG的总数,客户端通过模VG_NUMBER运算将对象映射到VG:
VG_ID=ObjectHash%VG_NUMBER
在本发明中,VG是数据复制或编码存储以及存储节点选择的基本单元。VG_NUMBER应足够大,例如为存储节点数量100倍。多个VG可能位于相同的存储节点中。实际上,VG_NUMBER只与本发明中的负载均衡粒度有关,与VG到存储节点的映射没有直接关系。
第二映射:将VG映射到对象的最终存储节点。本发明设计一种基于多层一致性哈希的存储负载均衡机制实现该映射。当一个节点加入集群时,将根据其存储容量为其分配一个权重值。例如,权重=1.0表示节点具有1T存储容量,权重=2.0表示具有2T存储容量等。权重表示一个节点与其他节点相比的相对存储能力。一层节点的权重等于下层所有节点的权重总和。第二映射有两个子流程:主节点选择,辅助节点选择。主节点接收来自客户端的请求并将请求分发给辅助节点。主节点选择和辅助节点选择都在分层树结构中执行,在该分层树的每层上都有节点,并且每个节点有一些预定义的属性。
本发明的示例性实施例中的存储节点属性表包括:节点标识(ID),父节点标识(ParentID),层级类型(LayerType),存储节点容量权重值(Weight)、节点虚拟标识(VIDs)、节点所属主机标识(ServerID)、节点所属机架标识(RackID)、节点所属集群标识(ClusterID)、IP、Port和节点状态(Status)等。节点之间的层次关系可以通过<ID,ParentID>确定。设备节点处于最低存储层级,可为一块硬盘或一个分区,为数据对象的最终存储目的地,其它存储层级(如Server,Rack等)标识设备节点的所处网络架构,用于对象到设备节点的映射。
图2为基于多层一致性哈希的存储节点的示例性分层树结构图。除了依据节点的分层树结构,节点的选择也应该符合存储规则。在本发明的示例性实施例中,存储规则包括:故障域级别,包括磁盘(Disk),服务器或主机(Sever),机架(rack),集群(Cluster)等;存储策略,包括:一个对 象存储设备中存储的对应单个对象的最大副本数量、一台服务器中存储的对应单个对象的最大副本数、一个机架中存储的对应单个对象的最大副本数量、一个集群中存储的对应单个对象的最大副本数等;负载均衡参数,包括:存储使用的偏差阈值(如允许节点间实际存储量偏差不超过5%或10%等);最大CPU使用率;最大网络使用率等。
主节点选择。选择VG存储合格节点列表的第一个节点作为主节点。主节点接收来自客户端的请求,对请求中的对象进行编码、解码或复制,将冗余对象分配给辅助节点(写过程),或从辅助节点收集对象块(读过程),并向客户端发送对请求的确认。图3给出了节点层中的示例性VID分配图,其用于解释主节点选择过程。假设图2所示为集群架构图。整个集群由顶层根节点(root)表示。根节点有两个机架,每个机架都有两台主机/服务器,每台主机都有两个磁盘。在本发明中,磁盘被称为设备或存储节点,是一种特殊类型的节点(无子层节点),并且是存储数据的目标设备。设备节点始终位于整个群集的底层。主节点选择采用深度优先搜索(Depth First Search,DFS)方式,使用存储规则作为修剪策略,从根节点到集群层次结构中的叶节点选择节点或设备。在图2中,根下面的第一层是机架层,机架层下面层次依次为服务器层,以及最下层的设备层。根据图3中所示的示例性实施例,主节点选择过程如下。在存储集群可用于存储数据之前,即集群初始化时,主节点选择可以由系统管理员执行。
(1)随机分配多个VID到每个机架,例如根据权重设置VID_NUM=5,则如图3所示,
机架1:{VID11,VID12,VID13,VID14,VID15}
机架2:{VID21,VID22,VID23,VID24,VID25}。
(2)将机架0和机架1的VID映射到哈希空间,即计算每个VID的哈希值,并根据哈希值将每个VID放入哈希环中。
(3)在相同的哈希空间中计算该VG的ID的哈希值,假设HASH(VG_ID)被映射到图3中的点“A”。
(4)从“A”开始沿顺时针方向查找到第一个VID是属于机架1的VID13,所以在机架层选择机架1。
(5)从机架层到设备层继续相同的处理,我们可以为此VG选择主节点,通常是磁盘或分区。
在主节点选择的示例中,每个节点将被分配多个VID。在本发明的实施例中,根据节点权重确定分配给每层中每个节点的VID的数量 (VID_NUM),每个节点权重等于其直接下层中所有子节点的权重总和。例如,假设单位权重表示1T存储空间,我们可以为1T存储分配5个VID。如果一个节点有5T存储,它将被分配25个VID。父节点的所有直接子节点将共享同一哈希环/空间。这样,服务器中的所有存储设备或节点(即该服务器的下层设备层)将共享相同的哈希环/空间。请注意,存储节点、设备和磁盘或其分区在本发明中是可互换的。如果一个服务器有6个磁盘或分区,每个磁盘有5个VID,这个服务器的哈希空间将被分成30个子空间,其中子空间最终将由虚拟组(VG)选择,也就是磁盘或分区将成为这个VG的存储目的地。
图4表明了辅助节点选择的过程。为了保证数据的高可用性和高可靠性,一个对象存储在除主节点以外的多个辅助节点中。主节点用于存储原始对象(对于多副本策略)以及存储由编码算法产生的第一分块(对于纠删码、网络编码等)。辅助节点用于存储原始对象的副本(多副本策略)或者存储对象编码后的其他数据/奇偶校验/编码块(对于纠删码、网络编码等)。在本发明的示例性实施例中可以使用两种用于辅助节点选择的策略:
(1)主节点到最后一个辅助节点依次选择策略:每个设备只选择一次,即每个存储节点在合格存储节点列表中出现一次。该策略适用于具有足够多节点的集群,例如集群的节点数多于存储策略中设置的对象副本的数量。
(2)主节点回环选择策略:如果集群的节点较少,则每个节点可以被选择多次。也就是说,每个主节点或辅助节点能保存一个对象或VG的多个副本或分块。
在本发明中我们只解释策略(1)。策略(2)对于本发明领域的技术人员来说是显而易见的。与主节点选择类似,我们仍然依据存储规则,并使用存储规则作为修剪策略以深度优先搜索(DFS)的方式从集群层次结构中的根节点到叶节点选择辅助节点。存储规则在选择辅助节点时设置限制。一个节点能否成为辅助节点候选者,由该候选者是否与其它已选择节点冲突(如已被选择)以及是否满足存储规则中定义的约束来确定。
假设我们想在同一个机架但非同一服务器中选择一个辅助节点,即在服务器层选择另一个属于不同服务器的子空间。图4为基于在前述主节点选择过程中选择的主节点位置来选择辅助存储节点的示例性实施例的层次图。我们使用VID来表示每层哈希环中的子空间。假设服务器1中的磁盘1在选择主服务器的过程中被选为VG的主服务器。VG位于属于服务器1的VID11的哈希子空间中。根据存储规则(假设一个服务器中只能保存一 个对象或VG副本),我们需要在服务器层选择VID11后面的另一个不属于服务器1的VID,即属于服务器2的VID20。接下来,在服务器2的设备层中选择子空间。VID应该选择从HASH(VG_ID)开始沿顺时针方向遇到的第一个哈希子空间节点。假设磁盘2的其中之一VID被选择,即该磁盘VID的哈希值紧跟在HASH(VG_ID)之后。如图4,服务器2中的磁盘2被选为第一个辅助节点。然后根据存储规则按照相同的方式从同一机架另一台服务器(或从父层机架层选择另一机架中的服务器)中选择第二个辅助节点,并按照相同方法依次选择其他的辅助节点,其中存储规则作为修剪策略加速选择过程。
图5为在本发明的示例性实施例中的初始数据放置过程之后的VG-存储节点映射表,表格设置基于集群节点的分层拓扑结构。对于每个VG,可以将集群层级架构、集群中每个节点的权重、每个节点的被分配VID列表及当前存储规则等作为输入来计算其存储位置。该过程由两个子过程组成:主节点选择和辅助节点选择,其处理过程已在上面描述过,此处不再阐述。对于本发明的优选实施例,可以在分布式内存数据库中保存VG-存储节点映射以获得高性能。VG-存储节点映射表由管理节点维护,管理节点调整VG-存储节点映射以使映射表反映存储节点在添加、删除或故障的情况下的成员状态的变更,继而将对象重新平衡到正确的目的地。在图5中VG-存储节点表的每个记录或条目都包含一个VG_ID及其对应合格节点列表,后者包括主节点和辅助节点。对于每个对象,客户端首先通过计算第一映射获取VG,然后使用VG_ID作为主键查询VG-存储节点表,客户端将对象发送到的存储节点列表所列节点,或从存储节点列表所列节点中读取对象。VG-存储节点映射以集群开始运行之前或存储节点未发生故障之前的初始数据布局映射为起始状态。
对于分布式存储系统,特别是大型分布式存储系统,故障发生是常态。除了故障,在存储节点添加/加入和移除(永久故障)/废弃(老化)的情况下,对系统进行维护也是必不可少的。因此,初始数据布局图应根据存储节点的状态变更而变更。为使初始数据布局映射与存储节点状态变更同步更新,首先要进行故障监测,因为故障是被动事件,而添加/加入和删除/废弃是可以由管理员主动处理的主动事件。
故障监测。假设监控节点的状态由PAXOS算法维护,则在任何时候都有一个主监控节点在线并维护所有存储节点的状态。如果一个监控节点由于故障而处于脱机状态,则另一个监控节点可以由管理员重启或自动重启。 在本发明的示例性实施例中,使用节点ID作为主键,并将所有存储节点的状态存储在分布式内存数据库中以实现高性能。监控节点和客户端可以检索VG合格存储节点列表的节点状态。每个VG合格存储节点列表的状态维护由监控节点在主节点和辅助节点的帮助下进行协调。故障监测可以通过心跳机制来实现。当处于空闲状态,即没有收到来自客户端的请求消息时,主节点向每个辅助节点发送心跳消息以声明其主节点位置。在处理来自客户端的请求时,主节点将对象副本或其分块发送到辅助节点,这也可以被视为心跳消息。在这种情况下,真正的心跳包会延迟发送。每个VG的主节点向VG的所有辅助节点发送心跳,然后每个辅助节点向主节点发送确认消息以声明其存活状态。如果任何辅助节点在某个预设时间间隔内没有回复心跳确认信息,则主节点向监控节点报告辅助节点的故障消息。如果任何辅助节点在某个当前时间间隔内没有收到心跳消息,辅助节点会将主节点的故障消息报告给监控节点。我们使用VG_Version来表示VG合格节点列表中主存储节点的状态变更。一个主节点故障,随后选择一个临时主节点,并且VG_Version增加1。真正的心跳消息仅包含VG_Version而没有任何其它数据。当任何VG的VG合格节点列表中的主节点发生故障时,监控节点根据临时主节点选择原则调用临时主节点(Temporary Primary Node,TPN)选择算法来选择临时主节点(TPN),从而为发生故障的主节点选择主替代节点。当任何辅助节点发生故障时,监控节点会为故障的辅助节点选择辅助替换节点。被替换节点替换的故障主节点或故障辅助节点称为宿主节点。
临时主节点选择原则。主存储节点在故障之前持有最长的提交事务列表(Committed Transaction List,CTL)和报送事务列表(Submitted Transaction List,STL),新的临时主存储节点应最大可能保持CTL和STL不变性(即被选临时主存储节点在所有的候选辅助节点中具有最长的CTL和STL)。选择原则:始终选择无任何替换节点且具有最大报送事务ID、最大提交事务ID和最新VG版本的第一个辅助存储节点作为临时主节点(TPN)。如果旧主节点在时间TEMP_FAILURE_INTERVAL内返回到集群,它将再次被分配为主节点。所谓主节点不变性就是主节点始终尽最大可能保持或恢复主节点身份。
临时主节点选择。监控节点依据临时主节点选择原则从当前在线辅助存储节点中选择新主节点。如果任何辅助存储节点没有在预设时间间隔内收到来自主节点的请求消息,辅助存储节点将把主节点的故障信息报告给 监控节点,临时主节点选择过程将会启动。如果任何辅助节点没有收到来自主节点的任何心跳或请求消息,则主-辅节点关系(Primary-Secondary Relationship,PSR)将被该辅助节点视为失效。一旦合格节点列表中的节点被选为任何VG的主节点或临时主节点,PSR就会建立。PSR由VG_Version标识,一旦选择了新的主节点或旧主节点返回,例如原故障主节点在故障后重新加入集群,VG_Version都会增加1。已失效PSR的辅助节点将向监控节点报告该变更消息。故障报告消息由最新提交事务ID(Latest Committed Transaction ID,LCMI),最新报送事务ID(Latest Submitted Transaction ID,LSTI)和当前VG版本(Current VG Version,CVV)组成。一旦接收到来自作为主要候选节点的辅助节点的报告消息,监控节点首先检查CVV,排除具有较低VG版本的候选者;然后检查当前候选者的LCMI,排除LCMI较低的节点;最后检查LSTI,排除LSTI较低的节点。经过所有这些检查后,如果只剩下一个候选者,则检查结束,该节点被选为临时主节点。如果在此过程中没有选择任何节点,则VG状态将被置为不可用。映射到此VG的任何对象请求将会失败,因为没有节点可以接受这些请求。如果临时主节点选择成功,它将代替发生故障的主节点处理来自客户端的后续请求。当发生故障的主节点在临时故障间隔TEMP_FAILURE_INTERVAL内重新加入集群,在一致性哈希表更新和数据恢复之后,新加入主节点再次被设置为主节点。图6为本发明的示例性实施例中的故障检测算法流程图。
基于本地亲和性原则的替换节点选择。如上所述,包括主节点和辅助节点在内的任何节点都会出现故障,监视器会选择一个替换节点来处理应该分配给发生故障节点的请求。对替换节点的选择将依据本地亲和性原则执行。也就是说,新的被选择候选者首先在同层中选择,例如首先在相同服务器层,然后是相同机架层等。本地亲和性原则意味着故障节点和替代节点可能处于相同故障域中。如果同一故障域中的候选者已经在合格节点列表中,即与该层另一存储节点冲突则以DFS方式在上层继续选择,直到选出满足存储规则的替换节点。与辅助节点选择类似,DFS搜索过程使用存储规则作为修剪策略。如果故障节点重新加入集群并快速恢复数据,使用本地亲和性原则进行选择可确保快速地将数据从替换节点转移到重新加入的节点。
图7为本发明的示例性实施例中存储节点发生故障时,VG-替换节点映射表。在优选实施例中,VG-替换节点(Replacement Node)映射可以通过 分布式内存数据库(Distributed Memory Database,DMDB)表格来实现。VG-替换节点表作为元数据存储在管理节点集群中。任何监控节点都可查询该表。如果任意监控节点故障,可以在当前或其他服务器上重新启动,由于数据库的一致性,VG-替换节点表是稳定(一致)的。VG-替换节点表由以<VG_ID,Host Node ID,Replacement Node ID List>为格式的条目组成,以<VG_ID,Host Node ID>作为主键。宿主节点(Host Node ID)是发生故障的主节点或辅助节点。对于每个VG,每个宿主节点可能有多个替换节点,因为替换节点也可能故障,当替换节点发生故障时,应该为此故障宿主节点选择另一个替换节点。但是,任何时候都只有一个在线的替换节点。替换节点和所有存储节点的状态由管理节点维护。
当发生故障时,对故障节点的请求将被重新定向到与其相应替换节点。在本发明中,我们定义两种故障:临时故障和永久故障。如果故障节点在预设临时故障间隔(TEMPORARY_FAILURE_INTERVAL)内未重新加入集群,则该节点将被标记为临时故障(TEMPORARY_FAILURE)状态。如果发生故障节点在TEMPORARY_FAILURE_INTERVAL内重新加入并报告至监视器,则该节点随后会标记为临时故障修复中(TEMPORARY_FAILURE_RECOVERING)状态,监视器将启动临时故障恢复过程。如果在预设永久故障间隔(PERMANENT_FAILURE_INTERVAL)时间之后,监视器仍然不能接收到来自该节点的重新加入的消息,则它将被标记为永久故障状态(PERMANENT_FAILURE)。监视器为此故障节点和存储在此故障节点中的VG启动永久故障修复。临时故障修复和永久故障修复包括两个阶段:分布式哈希表(Distributed Hash Table,DHT)修复和数据修复。DHT修复使从VG到合格存储节点列表的映射与集群中的存储节点成员状态更改保持一致。数据修复根据从VG到合格存储节点的映射将数据移动到其目的地。从VG到合格存储节点的映射由管理节点维护,不同VG条目的不同位置的更新可以由不同的管理节点同时执行。这就是分布式哈希表中“分布式”的含义。
分布式哈希表(DHT)的临时故障修复。当存储节点接收到表示某些节点被标记为TEMPORARY_FAILURE的集群状态变更时,它将遍历其存储的所有虚拟组(VG),如果VG合格节点列表中的有一个节点处于状态TEMPORARY_FAILURE,则将该VG的状态标记为TEMPORARY_RECOVERING。如果发生故障的节点是主节点,其职责将 被使用上述TPN选择算法选出的临时主节点(TPN)接管。故障节点的VID(一般为多个)将被临时屏蔽,TPN将根据本地亲和性原则首先在故障节点的同一故障域中选择替换节点,例如同一服务器、同一机架等。修复期间的更新操作将重新定向到替换节点。若发生故障的节点在时间TEMPORARY_FAILURE_INTERVAL内恢复,更新操作(的数据)将从临时节点移回到新恢复节点。这就是分布式哈希表的临时故障恢复的功能。
对于每个具有VG_ID的VG,假设属于VG合格节点列表的节点故障并且该节点是主节点。图8为本发明的示例性实施例中的临时故障恢复算法的示意图。假设第一个辅助节点如图8中所示服务器2的磁盘2VID11成为TPN。假设故障VG处于服务器1磁盘1中VID11的哈希空间,监控节点选择不属于磁盘1的位于同一设备层的另一个VID,即沿顺时针方向位于HASH(VID11)后的VID,在图8中,选择设备层哈希空间中不属于故障磁盘1且距离VID11最近的VID,故选择位于磁盘0中的VID00。所有更新将被重定向到磁盘0中的替换节点。磁盘1中的故障主节点返回后,将再次被选为主节点,其所有VID(一般为多个)都将返回到哈希环中。对于分布式哈希表的临时故障恢复,监控节点将更新存储节点信息表(含状态)、VG-存储节点映射表和VG-替换节点映射表。
监控节点首先修改存储节点表,将故障节点的状态标记为OFFLINE;对于VG-合格存储节点映射表更新,也将列表中对应故障节点的状态标记为OFFLINE。故障节点在重新加入集群并且其状态变更为ONLINE之前不会被选为主节点或替换节点且所有分配给它的VID在哈希空间中被屏蔽。对于VG-替换节点映射表更新,监控节点向VG-替换节点映射表增加输入条目<VG_ID,故障节点ID,替换节点ID>。可能有多个VG存储于故障节点。监控节点遍历VG-存储节点映射表以查找其存储合格节点列表中具有故障节点的VG。对于所有的VG,<VG_ID,故障节点,替换节点ID>将插入到VG-替换节点映射表中。
分布式哈希表的永久故障恢复。当监控节点收到表示某些节点的状态被标记为PERMANENT_FAILURE的集群状态变化时,它会遍历所有VG,如果其合格节点列表中的有节点处于PERMANENT_FAILURE状态,则将此VG的状态标记为PERMANENT_FAILURE_RECOVERING。对于监控节点:
(1)屏蔽所有属于故障节点的VID。
(2)对故障节点的上层节点(从直接父层直至根层)重新加权。如有 必要,上层节点可能会屏蔽一个或多个VID。
例如,如果服务器中的一个VID表示2个磁盘,并且服务器中有2个磁盘出现故障且这两个磁盘对应一个VID,则服务器将屏蔽此VID。如果此被屏蔽VID在哈希空间中的下一VID位于不同服务器,则会发生从该服务器到另一台服务器的数据迁移。如果VID的屏蔽没有发生在上层,则根据一致性哈希原则,数据迁移将位于同层的所有节点之间,且迁移速度将比迁移到其他层快,这就是本地亲和性原则的优势。
图9为本发明的示例性实施例中分布式哈希表的永久故障恢复算法的示意图。
假设重新加权后,服务器1中的VID保持不变,数据迁移只发生在位于同一设备层(服务器1下层)的磁盘1和磁盘0之间。假设磁盘1故障,其VIDs={DVID11,DVID12...}被屏蔽。磁盘1中的VG需要在VIDs={DVID01,DVID02...}的磁盘0中恢复。在恢复数据之前,VG-存储节点表中的VG-合格节点列表条目首先被修改。如果重新加权后,服务器1的VID11应该被屏蔽,服务器1中的数据将被迁移到另一台服务器(如图9服务器2)。在本发明的示例性实施例中,监控节点执行的分布式哈希表永久故障恢复算法过程如下:
(1)遍历VG-存储节点表,找到所有故障节点存储的所有VG及其包含故障节点的合格节点列表。
(2)对于每个VG,遍历VG-替换节点表,找到此故障节点的唯一在线替换节点。
(3)对于每个VG,用VG合格节点列表中的在线替换节点替换故障节点。
(4)数据修复完成后,从VG-替换节点表中删除条目<VG_ID,故障节点ID,替换节点ID>,其中数据恢复过程在下文介绍。
数据的临时性和永久故障修复。数据修复过程由VG的主节点(PN)或临时主节点(TPN)或独立修复节点进行协调,这些节点可以是监控节点或新加入的节点,VG是基本的修复单位。
假设用主节点处理修复过程。在VG存储节点列表的每个节点中,都保存一些用于数据存储和修复的永久变量。对于每个VG存储节点中的持久变量,VG首次创建时,VG版本的初始值为0;当VG合格节点列表改变时单调增加1。节点重新加入集群后,会连接到存储管理节点,检查是否有任何替换节点。如果不存在替换节点,则不需要修复重连节点的数据。 如果有替换节点并且有一个在线,则将报送和提交事务表从相应的替换节点复制到重新加入的节点。注意,在替换节点中,报送事务表(STL)和提交事务表(CTL)可以分层存储,即将STL存储在高性能介质(如SSD)中,将CTL存储在大容量但性能较低的介质中。在替换节点中,STL和CTL都不应用(落盘)于持久存储介质(例如HDD),它们将应用到其最终存储目的地,也就是重新加入的节点。但是,如果替换节点的宿主节点处于永久故障状态,则替换节点将扮演故障宿主节点的角色,这意味着STL和CTL将应用于替换节点。如果所有替换节点都处于脱机状态,则修复最多会延迟TEMPORARY_FAILURE_INTERVAL。如果替换节点在TEMPORARY_FAILURE_INTERVAL时间内恢复,则重新加入的节点将STL和CTL从替换节点复制到本地存储介质。如果在该时间段内没有替换节点返回到集群,则重新加入节点需要修复之前重定向到替换节点的数据。重新加入的节点向主节点(PN)或临时主节点(TPN)发送最后提交事务ID、PN或TPN回复其第一提交事务ID(First Committed Transaction Id,FCTI),最后提交事务ID(Last Committed Transaction Id,LCTI),第一报送事务ID(First Submitted Transaction Id,FSTI)和最后报送事务ID(Last Submitted Transaction Id,LSTI)。需要通过镜像复制、纠删码或网络编码来修复/再生事务ID范围在(Rejoining LCTI,Primary LCTI]内相关对象数据。事务ID在范围[Primary FSTI,Primary LSTI](无在线替换节点时)或范围[Rejoining FSTI,Replacement FCTI](有在线替换节点时)内的对象数据需要被修复/再生成以复原替换节点STL。
如果重新加入的节点是以前的主节点,则在获取事务ID区间(Rejoining LCTI,Primary LCTI]和[Rejoining FSTI,Replacement FCTI]之后,管理节点依据主节点不变性原则将重新设置此新加入节点为主节点,新主节点继续处理来自客户端的请求。如果存储节点在时间TEMPORARY_FAILURE_INTERVAL内无法重新加入集群,它将被视为永久故障节点(Permanent Failed Node,PFN)。该永久故障节点的替换节点将被提升到合格节点列表中,PFN中丢失的数据需要在这个新的合格节点中修复。主节点或临时主节点遍历VG目录中所有存储对象,并将对象名称列表和辅助修复节点ID列表发送到新的合格节点。若采用多副本,则辅助修复节点为此主节点或临时主节点,若采用(N,K)纠删码,则为VG合格节点列表中的前K个在线节点。新的合格节点逐一修复故障对象。对象名称列表中的名称组成包括但不限于<对象标识object id,对象版本object  version,最近操作类型last operation code,最近事务ID transaction id>。注意,版本较低对象不能覆盖具有相同对象ID但版本较大或相同的对象。辅助修复节点由管理节点确定,选择在线辅助修复节点的原则是:在VG合格节点列表中,首先选择没有替换节点的节点,然后选择具有较大提交事务ID的节点,最后选择具有较大报送事务ID的节点。
添加节点过程。添加节点时,集群组织架构图将改变。对添加节点的上层节点进行重新加权,直至根层。如果增加的存储空间足够大,重新加权路径上的节点将被分配新的VID。数据将从同一层的其它设备域迁移到该设备域。例如:同一层的两个机架,如果一个机架添加了很多磁盘,则另一个机架中的数据将迁移到该机架以实现存储负载均衡。
图10为本发明的示例性实施例中的添加节点算法的流程图。假设在设备层添加一个新磁盘2到服务器1。
监控节点根据其权重随机分配VID到磁盘2,如新磁盘2 VID21,新磁盘2 VID22等。服务器1的权重应该增加,以保持等于其所有子节点权重总和,如图中磁盘0、1和2。如果添加了足够多的磁盘,达到与上层一个VID的权重阈值相当,服务器1需要额外随机分配一个VID。假设服务器1未添加新VID,仅在旧磁盘和新磁盘之间发生数据迁移。由于新磁盘2具有与服务器1下层哈希环的子空间相对应的新VID,因此数据将从磁盘0和/或磁盘1迁移到新磁盘2以实现存储负载均衡。如果将足够多磁盘添加到服务器1中,则其可能会被分配额外的VID,这会导致服务器层中的哈希分区发生变更,从而引起服务器层内数据迁移。
在本发明的一个示例性实施例中,监控节点负责添加节点处理,过程如下:
(1)对于每个分配给新增节点的VID(前序VID),查找同一层中属于另一节点(后继节点)的沿顺时针顺序的后续VID(简称:后继VID)。显然,前序VID是同层哈希空间中后续VID的前序节点,新增节点与后续节点处于同一层。
(2)遍历VG-存储节点表中的所有VG,找到当前存储在后继节点(VG合格节点列表)VID子空间中,其哈希值(HASH(VG_ID))更接近前序VID的VG,即这些VG将被移动到新添加节点(前序节点)。
(3)对于每个应移动到前序节点的VG,在VG-存储节点表中更新该VG对应的合格节点列表,即在VG数据移动到前序节点之后,用前序节点替换合格节点列表中的后继节点。
移除/废弃节点过程。该过程与永久故障恢复过程相似,不同之处在于:故障处理是被动不可预知的,该过程是主动的,并且可由集群管理员处理。它无需数据修复,只需在本层或其上层进行数据迁移。当移除多个磁盘,屏蔽足够多VID时,移除点上层设备可能需要VID调整。由于一致性哈希的特性,数据迁移只发生在待移除/废弃节点和其他存储节点之间,没有移除的存储节点之间没有迁移。
如果足够多磁盘被移除,数据将迁移到上层设备中。图11为本发明的示例性实施例中的移除/废弃节点处理算法的示意图。假设设备层的磁盘2从服务器层的服务器1中移出。监控节点应屏蔽分配给磁盘2的所有VID。磁盘2的权重设置为0,这意味着磁盘2此时起不能接收任何数据。为了保持与图11中所有子节点磁盘0和磁盘1权重之和相等,服务器1的权重应该更新,即减去磁盘2的权重值。如果有足够多磁盘被移除,即达到该服务器层一个VID所对应的权重阈值,服务器1可能需要屏蔽VID。假设服务器1无需屏蔽VID,数据迁移将限于服务器1中的所有现有磁盘和待删除磁盘之间,即数据将从磁盘2迁移到磁盘0和/或磁盘1以进行存储负载均衡。如果许多磁盘从服务器1中删除,则可能需要屏蔽其某些VID,因为服务器层中的哈希分区发生更改,这会导致数据从服务器1迁移到同一层中的其他服务器,例如图中的服务器2。在本发明的示例性实施例中,监控节点进行移除/解除节点的处理如下:
(1)对于每个分配给待移除节点的VID(简称:前序VID),查找同层中属于另一节点(后继节点)的沿顺时针顺序的后续VID(后继VID)。显然,前序VID是同层哈希空间中后续VID的前序节点,待删除节点与后续节点处于同一层。
(2)遍历VG-存储节点映射表中的所有VG,找到存储在待删除节点中的VG,其哈希值位于该层的哈希空间中逆时针最接近前序节点VID的位置。这些VG应该移到后继节点。
(3)对于应移动到后继节点的在步骤2中找到的每个VG,在VG-存储节点映射表中更新与该VG对应的合格节点列表,即在将VG中的数据移动到后继节点后,在合格节点列表中用后继节点替换前序节点。
基于上述数据分布、存储负载均衡和数据修复的方法,系统可以提供高可用性和高可靠性。下面介绍在本发明的示例性实施例中提供的数据存储服务。存储服务可视为分布式存储系统的抽象存储层,为客户端访问存储系统(读取或写入)提供接入口。在本发明的示例性实施例中,我们实 现强一致性,确保特定对象上的所有相同操作序列以相同的顺序在所有节点上执行。对对象的相同操作可以请求多次,但由于存在对象更新版本控制,操作只会执行一次。数据一致性即使在出现故障的情况下也能得到保持。从单个客户端视角,分布式存储系统可认为单个只有一个存储节点在运行。本发明所属领域的技术人员可以根据本发明采用的原理对本发明进行修改以使其实现最终一致性。我们介绍对象读、写、更新、删除的执行过程,这些过程只是对象的基本操作,并且可以由本领域的技术人员在实际部署中扩展。
对象读、写、更新、删除过程。我们称存在于客户端或主节点中的文件片段(采用多副本存储策略),或文件片段编码后的编码块(采用纠删码或网络编码存储策略)为对象。每个来自文件片段的编码块与它的本地文件片段具有相同的对象ID,并且根据VG合格节点列表中的存储节点位置获得块ID。每个文件片段或编码块在存储节点中以单个文件存储,其文件名包含但不限于:对象ID、块ID、版本号等。
对象写入过程:假设客户端想要将文件存储到存储系统,客户端首先将文件划分为具有预设大小SEGMENT_SIZE的文件段,如果文件大小小于SEGMENT_SIZE,则需要在其末尾追加零以使文件占据整个文件段。文件段ID(SEGMENT_ID)可以基于由诸如管理服务器分配给文件的文件ID(FILE_ID)来计算。在本发明的示例性实施例中,管理节点集群维护和管理所有文件的元数据。在本发明的优选实施例中,元数据存储器可以在分布式内存数据库中实施。文件段ID根据其在文件中的偏移量从0开始逐个单调增加计算得到,例如,文件FILE_ID的第一段具有文件段IDFILE_ID_0,第二段文件段ID为FILE_ID_1,依此类推。根据采用的数据冗余策略,例如:多副本、纠删码或网络编码,来自文件的每个文件段被复制或编码成多个块,然后使用两次映射将其存储在存储节点中。对于多副本,每个块的数据与文件段相同;如果使用(K,M)编码方案,对于纠删码或网络编码,该文件段首先划分为K个原始数据块,然后编码生成M个编码/校验块。由文件段生成的所有块在存储系统层通过SEGMENT_ID进行检索,在存储节点的本地文件系统中由SEGMENT_ID和BLOCK_ID组合进行检索。在本发明中,我们称每个块为对象。对于每个文件段,
(1)客户端使用哈希函数计算VG的ID(VG_ID),即:从段/对象到VG的第一映射:
VG_ID=Hash(SEGMENT_ID)%VG_NUMBER。
(2)客户端从其中一个管理节点检索该VG的主存储节点的ID,管理节点将保持从VG到其合格存储节点列表的映射。对管理节点的选择可以通过哈希函数进行负载均衡:
Management_Node_Id=Hash(VG_ID)%
MANAGEMENT_NODE_NUMBER。
(3)客户端将文件段/对象发送到主节点,对象的初始版本是0。
(4)收到该文件段后,主节点查找VG_ID目录,检查是否具有与新对象文件段相同ID的对象存在。如果是,则主节点拒绝该请求并回复该对象当前版本。如果对象不存在于VG_ID目录中,则主节点将当前VG事务ID增加1,组合<TransactionID,VGVersion,ObjectOperationItem>以形成事务,将新事务附加到报送事务表中(STL),并将STL的长度增加1。OBJECT_OPERATION_ITEM包括但不限于ObjectID、BlockID、ObjectVersion、VGVersion、ObjectOperationCode=WIRTE、ObjectOperationOffset、ObjectOperationLength、ObjectOperationDataBuffer。ObjectOperationCode包括但不限于WRITE、READ、UPDATE、DELETE等。对于写入操作,ObjectOperationCode是WRITE。为加快响应速度,STL可以存储在高性能介质如SSD的日志文件中。在本发明的优选实施例中,STL可以使用日志分级机制(见下文)来实现。
(5)对于每个辅助节点,根据VG合格节点列表中该辅助节点的位置,主节点修改事务中的BlockID,并使用消息机制转发修改的事务到相应的辅助节点,事务中BlockID会表明辅助节点在VG合格节点中的位置。每一个对象请求称为一个事务,事务组成一般包括:事务ID、虚拟组ID、虚拟组版本、对象ID、对象版本、对象数据和操作类型。
(6)在接收到事务/请求消息后,每个辅助节点检查事务中VG版本VGVersion及其当前VG版本CurrentVGVersion,如果VGVersion<CurrentVGVersion则拒绝请求,否则,将事务附加到此VG的本地提交事务表(STL),并向主节点发送成功确认消息。
(7)从所有辅助节点接收到(写入)成功的确认消息后,主节点向客户端发送确认消息以证实请求的文件段已经正确的存储在存储系统中,然后异步地将文件段或其编码块作为单个文件保存到本地磁盘中,我们称为提交COMMIT。同时,主节点向所有辅助节点发送COMMIT请求,以使它们持久化保存与之对应的块。
(8)在上步之后,事务安全地存储在存储系统中,但每个事务中包含的 对象可能尚未保存到其最终目的地:本地文件系统。但对于客户端来说,该对象已成功存储在系统中。
接下来,我们描述可以在对象读取、更新、删除过程中调用的事务提交机制(COMMIT)和日志分级机制。
事务提交机制包含主节点处理过程、辅助节点处理过程、替换节点处理过程。
主节点处理过程:
(1)获取主节点的一个VG报送事务表(STL)的第一个未提交的事务条目。
(2)读取此事务的VGVersion,如果事务VGVersion等于VG当前版本CurrentVGVersion,在接收到所有来自辅助节点的确认信息后,如果事务中的对象不存在(对象写操作ObjectOperationCode==WRITE),主节点将此事务的对象数据(ObjectOperationDataBuffer)存储到本地文件系统名为ObjectID.BlockID.ObjectVersion的文件中。如果对应于该对象的文件存在,事务操作码为更新(ObjectOperationCode==UPDATE),且对象版本等于对象当前版本加1(ObjectVersion==CurrentObjectVersion+1),此事务为合法事务,否则返回错误给客户端,拒绝执行此事务。如果事务对应对象是数据块(多副本策略文件或纠删码/网络编码原生数据未编码部分),该事务中的数据将在<OBJECT_OPERATION_OFFSET,OBJECT_OPERATIN_LENGTH>区间覆盖该文件中的数据。如果事务对应对象是编码块,需合并(异或操作)事务中的数据与当前文件<偏移量ObjectOperationOffset,长度ObjectOperationLength>区间内的数据。对应此事务的文件名被修改为ObjectID.BlockID.ObjectVersion。主节点标记该事务为已提交(COMMITED)。本发明替换节点选择算法确保主节点以较高概率从所有辅助节点获得事务正确提交的确认消息。如果主节点在预设间隔(COMMIT_TIMEOUT)内没有得到所有确认,则将此事务标记为DESUBMITTED并视其为无效事务,发送DESUBMITTED消息通知所有辅助节点,以将此事务标记为DESUBMITTED。如果DESUBMITTED事务是日志文件中最后一个条目,它将被删除。
(3)由主节点选择过程可知,事务VGVersion不可能大于主节点中VG当前版本CurrentVGVersion。如果VGVersion<CurrentVGVersion,则该事务由前一个主节点发出。如果之前主节点未将该事务复制到所有辅助节点,则当前主节点可能需要修复辅助节点中的某些块。根据修复结果,主节点 决定是否继续事务或将事务标记为DESUBMITTED。
(4)对于多副本冗余策略,主节点将格式为<VGTransactionID,VGVersion,ObjectID,BlockID,ObjectVersion>的查询事务SEARCH_TRANSACTION消息发送到所有尚存的辅助节点,如果所有辅助节点都包含此事务,则主节点可以安全地提交此事务,其过程与步骤2类似。如果有任何辅助节点都没有此事务,则主节点将再次复制该事务到相应缺失辅助节点,其本应由先前主节点完成。
(5)对于纠删码码或网络编码冗余策略,主节点将格式为<VGTransactionID,VGVersion,ObjectID,BlockID,ObjectVersion>的查询事务SEARCH_TRANSACTION消息发送到所有尚存的辅助节点,如果所有辅助节点都包含此事务,主节点可安全地提交此事务,其过程与步骤2相似。如果有任何辅助节点没有该事务且当前冗余策略为纠删码时,主节点将尝试重新生成(解码修复)原始文件段并再次对该文件段进行编码以重新生成故障或未分发块;当冗余策略为网络编码时,主节点从幸存的辅助节点收集足够的幸存块以重新生成故障/未分发的块(无需恢复整个文件段);故障块修复完成后主节点重新生成事务并再次转发事务到相应缺失辅助节点,此事务也应由先前主节点完成的。
辅助节点处理过程:在收到格式为<TransactionID,VGVersion>的形式的事务提交请求CommitTransaction后,每个辅助节点都在报送事务列表STL中搜索事务CommitTransaction。如果找到TransactionID和VGVersion都符合的待提交事务,则辅助节点提交CommitTransaction(持久化存储此事务中对象数据到本地文件系统)。
替换节点处理过程:在接收到事务格式为<TransactionID,VGVersion>的事务提交请求时,故障宿主节点的替换节点在STL中搜索该事务。如果找到TransactionID和VGVersion都符合的待提交事务,则替换节点无需提交此事务,只需将事务标记为已提交COMMITTED。一旦替换节点成为主节点或辅助节点,STL中的事务将在故障宿主节点修复过程中提交。如果涉及更新(UPDATE操作)事务,宿主节点事务对象可能需要与替换节点中的恢复对象进行合并。
如果辅助节点发生任何故障,主节点将从管理节点更新VG合格节点列表。如果更新成功,每个故障辅助节点将被替换节点临时替换。主节点将重发事务到新的替换节点。在包括辅助节点及其替换节点的所有节点都返回成功执行事务的确认消息后,主节点回复客户端事务成功的消息。如 果管理节点中无法检索到故障节点的替换节点,主节点将无限期地重复查询过程,随之由客户端对该对象的请求将全部被拒绝。
日志分级机制。为实现低延迟,特别对更新操作,在本发明的优选实施例中,在高性能存储介质如SSD中将所有事务顺序地附加到提交的日志文件的末尾,即报送事务列表存储于高性能SSD中。SSD价格昂贵,同等价位容量也低于HDD,而HDD的硬盘价格便宜且容量大。考虑到这些差异,我们将处于COMMITTED或DESUBMITTED状态的事务移至在HDD中持久化存储的提交日志文件中。在报送日志文件中事务条目从SSD移动到HDD之前,我们删除对象数据并仅移动事务中的其它描述信息到提交日志文件中。提交日志文件中的事务信息可以被认为是存储在同一存储节点中的目标文件的元数据。提交日志文件可用于在发生永久故障时加速对象修复过程。因为提交日志文件记录了VG中提交的所有事务,所以永久性故障存储节点的替换节点可以从当前主节点或临时主节点获取提交日志文件,并获取需要修复的所有对象ID,否则,主节点或临时主节点需要扫描VG目录中的所有文件,这是非常耗时的。
提交事务列表(CTL)压缩。如上所述,CTL的主要目的是(永久)故障恢复。CTL包含描述存储在该VG中的对象的历史操作的所有事务信息。实际上,我们只需VG中每个对象的最新操作信息。因此,我们可以遍历CTL,删除同一对象的重复事务条目,并且仅保留可由事务ID和VG版本确定的最新事务(即具有最大VG版本和最大事务ID的事务)。同一对象事务的重复数据删除减少了提交日志文件大小,因此可以较少存储空间记录更多的对象事务。
对象读取过程。在主存储节点无故障情况下,客户端总是从主存储节点读取文件段。对于每个文件段,
(1)客户端计算VG_ID,利用第一映射将对象/文件段ID映射到VG:
VG_ID=HASH(SEGMENT_ID)%VG_NUMBER。
(2)客户端通过对VG_ID进行哈希运算选择一个管理节点:
Management_Node_Id=Hash(VG_ID)%
MANAGEMENT_NODE_NUMBER
(3)客户端从管理节点获取主节点ID,其中管理节点维护VG到VG合格存储节点列表的第二映射。
(4)客户端将读取请求发送到与此对象相对应的主节点。
(5)对于纠删码或网络编码数据保护方案,主节点随后从本地存储器和 K-1个幸存的辅助节点收集文件段的K个块,并重构该文件段。
(6)主存储节点将该文件段发送到客户端。
如果主节点发生故障,读取请求的处理信息将转移到临时主节点(TPN)。对于多副本数据保护策略,TPN将在本地存储中检索该对象的副本,并立即回复客户端。对于纠删码或网络编码,TPN将从VG合格节点列表中的前K幸存辅助节点收集K个块,重建原始对象并将该段/对象发送给客户端。故障期间的读取过程称为降级读取。
对象更新过程。
(1)假设客户端想要更新处于位置OFFSET处的对象,更新长度为LENGTH。客户端将当前版本加1以获取下一版本,即:
NewVersionClient=CurrentVersionClient+1。
(2)客户端计算从对象/段ID到VG的映射以获取VG_ID:
VG_ID=HASH(SEGMENT_ID)%VG_NUMBER。
(3)客户端通过哈希VG_ID来选择一个管理节点:
Management_Node_Id=Hash(VG_ID)%
MANAGEMENT_NODE_NUMBER。
(4)客户端从管理节点获得主节点ID,管理节点维护VG到VG合格节点表的第二映射。
(5)客户端将更新的数据部分发送到与此对象相对应的主节点。
(6)主节点从包含ObjectVersion的文件名获取此更新对象的当前版本CurrentVersionPrimary。如果NewVersionClient<=CurrentVersionPrimary,则拒绝来自客户端的更新请求。
(7)对于多副本数据保护,增加此VG的事务ID:VG_Transaction+1,设置更新事务ID TransactionID:TransactionID=VG_Transaction,组装事务<TransactionID,VGVersion,ObjectOperationItem>,将此事务追加到报送事务列表(STL),并将STL的长度加1。ObjectOperationItem包括但不限于<ObjectID,BlockID,ObjectVersion=NewVersionPrimary,ObjectOperationCode=UPDATE,ObjectOperationOffset,ObjectOperationLength,ObjectOperationDataBuffer>,并根据辅助节点在合格节点列表中的位置将事务转发到所有具有BlockID设置的辅助节点。
(8)在接收到事务后,每个辅助节点使用与对象写入过程相同的规则追加该事务。如果成功,则向主节点回复确认消息。
(9)对于纠删码或网络编码,主节点从本地存储或存储此对象更新部分 的其它辅助节点获取位置和长度<OFFSET,LENGTH>对应的旧数据,计算数据增量,即新旧数据异或:
Data_Delta=New_Data_at_OFFSET_LENGTH⊕Old_Data_at_OFFSET_LENGTH。
(10)主节点根据纠删码或网络编码算法中定义的方案,将每个更新的数量增量Data_Delta视为单个文件段/对象来计算校验块增量(Partity_Delta)。主节点按照步骤(7)将Data_Delta组装成事务,将事务附加到其本地STL或发送到负责此更新部分的辅助节点。然后,主节点组装Partity_Delta的事务并转发到与其对应的辅助节点,此过程与步骤(7)相同。辅助节点的处理与步骤(8)相同。
(11)在从所有辅助节点接收到成功执行事务的确认消息之后,主节点提交事务并向所有响应的辅助节点发送事务提交请求,包括存储Data_Delta和Parity_Deltas的所有辅助节点将执行此事务提交请求,将对象更新持久化存储到本地文件系统。
注意,提交更新数据过程在数据冗余方案为多副本或纠删码/网络编码保护方案时存在差异。当采用多副本时,主节点和辅助节点都会用新的更新数据部分覆盖旧数据部分。当采用纠删码或网络编码时,新的更新数据部分覆盖旧的数据部分,而每个新的校验增量块部分异或(XOR)对象中对应的旧校验部分。
对象删除过程。
(1)客户端将对象当前版本加1获得下一版本,即NewVersionClient=CurrentVersionClient+1。
(2)客户端计算从对象/段ID到VG的映射获取VG_ID:
VG_ID=HASH(SEGMENT_ID)%VG_NUMBER。
(3)客户端通过哈希VG_ID来选择一个管理节点:
Management_Node_Id=Hash(VG_ID)%
MANAGEMENT_NODE_NUMBER。
(4)客户端从管理节点获得主节点ID,管理节点维护VG到VG合格节点表的第二映射。
(5)客户端将包含<ObjectID,DELETE>的DELETE请求发送到其主节点。
(6)主节点获取此待删除对象CurrentVersionPrimary的当前版本。如果NewVersionClient!=CurrentVersionPrimary+1,则拒绝来自客户端的删除请 求。
(7)否则,主节点增加此VG的事务ID:VG_TransactionID++,获取此删除请求的事务ID:TransactionID=VG_TransactionID,将事务<TransactionID,VGVersion,ObjectID,BlockID,ObjectVersion=NewVersionClient,DELETE>追加到报送事务列表(STL)并根据辅助节点在合格节点列表中的位置将事务转发到所有具有对应BlockID设置的辅助节点。
(8)在接收到删除事务后,每个辅助节点使用与对象写入过程相同的规则追加该事务,如果成功,则向主节点回复确认消息。
(9)在从所有辅助节点接收到事务执行成功确认消息后,主节点提交DELETE事务并将事务提交请求发送到所有辅助节点。对于DELETE事务,主节点或辅助节点通常不会直接删除本地文件系统中的对象,而只会将该对象标记为DELETEING,例如,将DELETING标记添加到与该对象相对应的文件名中。真正的删除操作按照预设的策略由后台进程异步执行,例如,策略通常规定在一段时间后永久删除对象。
基于基本对象操作,包括对象写入、读取、更新、删除等过程,可以构建分布式文件存储系统,系统能提供包括但不限于写入、读取、更新、删除等文件操作。
以上应用具体实施例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想,还可以做出若干简单推演、变形或替换。

Claims (31)

  1. 一种基于多层一致性哈希的分布式数据存储系统,其特征在于,包括:
    多个提供数据存储和冗余保护的存储节点;
    多个维护所述存储节点属性及虚拟组-存储节点映射信息的管理节点;
    多个维护所述存储节点状态,处理所述存储节点的添加、删除和故障等状态变化的监控节点;和
    一个或多个提供应用程序或用户访问存储系统接入点的客户端。
  2. 根据权利要求1所述的系统,其特征在于,所述存储节点的属性包括:节点标识,父节点标识,层级类型,存储节点容量权重值、节点虚拟标识、节点所属主机标识、节点所属机架标识、节点所属集群标识、IP、Port和节点状态。
  3. 根据权利要求1所述的系统,其特征在于,所述存储节点基于每个所述存储节点的属性形成存储架构树,所述存储架构树有多层,每层包含不同种类的节点,例如,根层表示整个存储集群,设备层位于树底部,是数据存储的目的地。
  4. 根据权利要求3所述的系统,其特征在于,所述存储架构树的每一层都是其直接子节点的父层,父节点权重等于其所有直接子节点权重总和。
  5. 根据权利要求1所述的系统,其特征在于,所述管理节点维护基于哈希的映射信息,包括:
    从虚拟组到合格存储节点列表的映射信息;
    从虚拟组及故障存储节点到所述故障存储节点的替换节点的映射信息。
  6. 根据权利要求5所述的系统,其特征在于,所述合格存储节点列表由主节点和一个或多个辅助节点组成。
  7. 根据权利要求1所述的系统,其特征在于,所述监控节点作为协调者处理所述存储节点的状态变更。
  8. 根据权利要求1所述的系统,其特征在于,所述客户端发起对象访问 请求,并使用两阶段映射找到与此对象相对应的主节点。
  9. 根据权利要求8所述的系统,其特征在于,第一映射是将对象映射到作为对象容器的虚拟组。
  10. 根据权利要求8所述的系统,其特征在于,第二映射是将虚拟组映射到合格节点列表。
  11. 根据权利要求1所述的系统,其特征在于,所述客户端始终将包含对象操作的请求发送给主节点,所述主节点将请求重组并转发到同一合格节点列表中的辅助节点,所述客户端与所述主节点、所述主节点与所述辅助节点之间采用消息机制。
  12. 根据权利要求5所述的系统,其特征在于,每个虚拟组的合格存储节点列表由两种方法产生:
    所述主节点选择:以深度优先搜索的方式从集群层次结构中的根节点到叶节点选择节点;
    所述辅助节点选择:使用存储规则作为修剪策略,以深度优先搜索的方式从所述集群层次结构中的根节点到叶节点逐个选择所述辅助节点。
  13. 根据权利要求12所述的系统,其特征在于,所述存储规则包括:
    用于定义故障保护域的故障域级别;
    为虚拟组或对象定义每个层中存储的副本/块的最大数量的存储策略;
    定义用于选择节点的系统设置的负载均衡参数。
  14. 根据权利要求1所述的系统,其特征在于,所述存储节点以两种方式通过心跳消息交换其状态:
    在进行数据传输时,从主节点到辅助节点的转发请求同时被视为心跳包;
    空闲时,不包含任何对象数据的从所述主节点到所述辅助节点消息将用作心跳包,所述辅助节点向所述主节点发送回复以声明处于在线状态。
  15. 根据权利要求1所述的系统,其特征在于,所述存储节点与所述监控节点协作处理故障监测,对于每个虚拟组:
    虚拟组合格节点列表的主节点如果在预设的超时间隔内没有从辅助节点接收到确认消息,则向协调器监控节点报告所述辅助节点的故障信息;
    如果虚拟组合格节点列表的任何所述辅助节点在预设的超时间隔内没有从所述主节点接收到请求或心跳消息,则向所述协调器监控节点报告所述主节点的故障信息;
    虚拟组具有版本号,初始值为0,每次其合格节点列表中节点状态发生变化时单调加1;
    可以通过计算HASH(虚拟组标识)%(监控节点数)来选择所述协调器监控节点。
  16. 根据权利要求15所述的系统,其特征在于,所述协调器监控节点处理故障恢复,包括:
    临时主节点选择:使用提交事务列表或报送事务列表不变原则和主节点不变原则;
    替换节点选择:主节点或辅助节点的替换节点选择使用本地亲和性原则。
  17. 根据权利要求16所述的系统,其特征在于,所述提交事务列表或所述报送事务列表不变原则保证所选临时主节点在所有幸存的所述辅助节点中具有最长的所述提交事务列表和所述报送事务列表。
  18. 根据权利要求16所述的系统,其特征在于,所述主节点不变原则保证所述主节点始终以最大可能维持为主节点。
  19. 根据权利要求16所述的系统,其特征在于,所述本地亲和性原则保证新选择的候选者优先在同一层节点中选择。
  20. 根据权利要求16所述的系统,其特征在于,所述故障包括两种类型:
    临时故障:如果故障节点在预设时间TEMPORARY_FAILURE_INTERVAL内没有重新加入集群,则所述故障节点被标记为TEMPORARY_FAILURE的状态;
    永久故障:如果在PERMANENT_FAILURE_INTERVAL时间之后,所 述协调器监控节点仍然不能接收到来自所述故障节点重新连接的消息,则所述故障节点被标记为PERMANENT_FAILURE状态。
  21. 根据权利要求20所述的系统,其特征在于,
    所述临时故障由临时故障修复进程处理;
    所述永久故障由永久故障修复进程处理;
    所述临时故障修复进程和所述永久故障修复进程都由两阶段组成:
    分布式哈希表修复:使从虚拟组到合格存储节点列表的映射与集群中的所述存储节点成员状态变更现状保持一致;
    数据修复:根据修复的从所述虚拟组到所述合格存储节点的映射信息将数据移动到其应保存目标节点。
  22. 根据权利要求21所述的系统,其特征在于,分布式哈希表修复阶段包括:
    所述协调器监控节点遍历虚拟组-存储节点映射表以查找所述合格存储节点列表中具有所述故障节点的虚拟组;
    对于每一个虚拟组,将<虚拟组标识,故障节点标识,替换节点标识>插入到虚拟组-替换节点映射表中。
  23. 根据权利要求21所述的系统,其特征在于,分布式哈希表恢复阶段包括:
    遍历所述虚拟组-存储节点映射表,找到其合格节点列表中含有所述故障节点的所有虚拟组;
    对于每个虚拟组,遍历所述虚拟组-替换节点映射表,找到所述故障节点的唯一在线替换节点;
    对于每个虚拟组,用所选在线替换节点替换此虚拟组合格节点列表中的所述故障节点;
    待完成数据恢复后,从虚拟组-替换节点表中删除条目<虚拟组标识,故障节点标识,替换节点标识>。
  24. 根据权利要求21所述的系统,其特征在于,由每个虚拟组的主节点 或临时主节点负责协调数据修复,其修复过程包括:
    对于故障后新加入节点,如果没有替换节点存在,则不需要修复;
    对于故障后新加入节点,如果有所述替换节点并且有一个在线,则将报送事务列表和提交事务列表从相应的所述替换节点复制到重新加入的节点;
    对于处于永久故障状态的节点,其所述替换节点接管故障节点在虚拟组存储节点列表中的位置,并从主节点获取提交事务列表。
  25. 根据权利要求1所述的系统,其特征在于,所述存储节点的加入由添加节点过程处理,包括:
    对于分配给新增节点即前序节点的每个虚拟标识,查找同一层中沿顺时针方向属于另一个后继节点的虚拟标识;
    遍历虚拟组-存储节点映射表中的所有虚拟组,找到存储在所述后继节点中但其哈希值更接近所述前序节点虚拟标识的虚拟组,这些虚拟组应移至新添加节点,即前序节点;
    对于每个应移动到所述前序节点的虚拟组,在所述虚拟组-存储节点映射表中更新虚拟组对应的合格节点列表,即在将虚拟组中的数据移动到所述前序节点后,用所述前序节点替换所述合格节点列表中的所述后继节点。
  26. 根据权利要求1所述的系统,其特征在于,存储节点的移除或废弃由移除/废弃节点过程处理,包括:
    对于分配给待移除节点即前序节点的每个虚拟标识,找到同一层中沿顺时针方向属于另一个节点即后继节点的虚拟标识,虚拟标识所属节点即为待移除节点的后继节点;
    遍历虚拟组-存储节点映射表中的所有虚拟组,找到存储在所述前序节点中的虚拟组,这些虚拟组将被移到所述后继节点;
    对于移至所述后继节点的每个虚拟组,更新所述虚拟组-存储节点映射表中与所述虚拟组对应的所述合格节点列表,即在将所述虚拟组中的数据移动到所述后继节点后,用所述后继节点替换所述合格节点列表中的所述 前序节点。
  27. 根据权利要求1所述的系统,其特征在于,所述存储节点,所述管理节点和所述监控节点彼此协作以提供实现提供对象访问方法的存储抽象层,包括:
    对象写入过程:所述客户端将新对象上传到存储系统;
    对象读取过程:所述客户端从所述存储系统下载对象;
    对象更新过程:所述客户端修改所述存储系统中的已有对象;
    对象删除过程:所述客户端从所述存储系统中删除已有对象。
  28. 根据权利要求27所述的系统,其特征在于,所述对象写入过程包括:
    (1)所述客户端使用哈希函数和对象标识来计算虚拟组标识;
    (2)所述客户端从其中一个所述管理节点检索此所述虚拟组的主节点标识;
    (3)所述客户端将文件段或对象发送到所述主节点,对象的初始版本是0;
    (4)所述主节点收到对象后,检查对象存在性,如果对象已存在,则拒绝请求,否则,创建一个事务,并将新事务追加到报送事务列表;
    (5)对于每个辅助节点,所述主节点根据所述辅助节点在合格存储节点列表中位置修改事务中的快索引BlockID,并将修改后的事务转发到相应的所述辅助节点;
    (6)在收到事务或请求消息后,每个所述辅助节点检查虚拟组版本的一致性,如果事务中对象虚拟组版本小于事务对象映射到的当前虚拟组版本,则拒绝此事务请求,否则,将该事务附加到本地报送事务列表,并向所述主节点发送事务执行成功的确认消息;
    (7)从所有所述辅助节点接收到成功确认的消息后,所述主节点向所述客户端发送确认消息,以确认请求对象已正确存储在所述存储系统中,然后将该对象作为单个文件异步地保存到本地文件系统,并将事务追加到提交事务列表;同时,主节点向所有所述辅助节点发送提交事务请求,以使 所述辅助节点持久化对象到其本地文件系统并同样追加事务至其提交事务列表。
  29. 根据权利要求27所述的系统,其特征在于,所述对象读取过程包括:
    (1)所述客户端使用哈希函数和对象标识来计算虚拟组标识;
    (2)所述客户端通过对虚拟组标识计算哈希值选择一个所述管理节点;
    (3)所述客户端从所述管理节点获取主节点标识;
    (4)所述客户端将读取请求发送到与此对象相对应的所述主节点;
    (5)对于纠删码或网络编码数据保护方案,所述主节点随后从本地文件系统和K-1个幸存的所述辅助节点收集对象的K个数据块或编码块,并重构所述对象;对于多副本方案,所述主节点从本地存储获取所述对象副本;
    (6)主存储节点将所述对象发送给所述客户端。
  30. 根据权利要求27所述的系统,其特征在于,所述对象更新过程包括:
    (1)所述客户端计算对象的下一版本,即当前版本加1;
    (2)所述客户端使用哈希函数和对象标识来计算虚拟组标识;
    (3)所述客户端通过对虚拟组标识进行哈希运算选择一个所述管理节点;
    (4)所述客户端从所述管理节点获取所述主节点标识;
    (5)所述客户端将更新的数据部分发送到与此对象相对应的主节点;
    (6)所述主节点从文件名获取待更新对象的当前版本,如果新版本小于等于当前版本,则拒绝更新请求;
    (7)否则,对于多副本数据保护方案,主节点增加此虚拟组的事务索引:VG_Transaction,形成包含更新的新事务,将新事务附加到报送事务表,并将事务转发给所有辅助节点;
    (8)在接收到事务后,每个所述辅助节点使用与所述对象写入过程相同的规则追加该事务到报送事务列表,如果成功则向所述主节点回复事务执行成功的确认消息;
    (9)对于纠删码或网络编码,所述主节点从本地存储或一个其它所述辅 助节点获取旧数据,通过异或旧数据和新数据来计算更新增量Data_Delta,所述辅助节点为更新数据块对于存储节点;
    (10)所述主节点根据在纠删码或网络编码算法中定义的方案通过将每个更新数据增量Data_Delta视为单个文件段或对象来计算校验块增量,所述主节点使包含Data_Delta的事务附加到本地报送事务列表或将事务发送到所述辅助节点,然后所述主节点创建并转发包含校验块增量Parity_Delta的事务到其对应的所述辅助节点,所述辅助节点的处理与步骤(8)相同;
    (11)在从所有所述辅助节点接收到事务成功执行的确认消息后,主节点向所述客户端返回更新事务成功执行消息,随后,所述主节点异步提交事务或向更新增量对应的所述辅助节点发送事务提交请求,所述辅助节点包括存储Data_Delta的所述辅助节点和存储Parity_Delta的所有辅助节点。
  31. 根据权利要求27所述的系统,其特征在于,所述对象删除过程包括:
    (1)所述客户端计算所述对象的下一版本,即当前版本加1;
    (2)所述客户端使用哈希函数和所述对象标识计算虚拟组标识;
    (3)所述客户端通过对虚拟组标识进行哈希运算选择一个所述管理节点;
    (4)所述客户端从所述管理节点获取所述主节点标识;
    (5)所述客户端将包含<对象标识,DELETE>的DELETE请求发送到所述主节点;
    (6)所述主节点获取此待删除对象的当前版本,如果请求中对象新版本不等于当前版本加1,则拒绝该删除请求;
    (7)否则,所述主节点增加此虚拟组的事务索引,将DELETE事务附加到报送事务列表,并根据其在合格节点列表中的位置将事务转发到所有对应BlockID的辅助节点;
    (8)在收到事务后,每个所述辅助节点都会使用与所述对象写入过程相同的规则追加该事务,如果成功则向所述主节点回复确认消息;
    (9)在从所有所述辅助节点接收到所有事务执行成功的确认消息后,所 述主节点提交DELETE事务并将事务提交请求发送给所有所述辅助节点,对于DELETE事务提交,所述主节点或所述辅助节点通常不会直接删除本地文件系统中的所述对象,而只会将所述对象标记为DELETEING,真正的删除操作按照预先设定的策略执行,即通常在某个时间间隔后永久删除对象,且删除操作在后台进程中异步执行。
PCT/CN2018/095083 2018-07-10 2018-07-10 基于多层一致性哈希的分布式数据存储方法与系统 WO2020010503A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/095083 WO2020010503A1 (zh) 2018-07-10 2018-07-10 基于多层一致性哈希的分布式数据存储方法与系统
CN201880005526.5A CN110169040B (zh) 2018-07-10 2018-07-10 基于多层一致性哈希的分布式数据存储方法与系统
US17/059,468 US11461203B2 (en) 2018-07-10 2018-07-10 Systems and methods of handling node failure in a distributed data storage using multi-layer consistent hashing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/095083 WO2020010503A1 (zh) 2018-07-10 2018-07-10 基于多层一致性哈希的分布式数据存储方法与系统

Publications (1)

Publication Number Publication Date
WO2020010503A1 true WO2020010503A1 (zh) 2020-01-16

Family

ID=67644890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095083 WO2020010503A1 (zh) 2018-07-10 2018-07-10 基于多层一致性哈希的分布式数据存储方法与系统

Country Status (3)

Country Link
US (1) US11461203B2 (zh)
CN (1) CN110169040B (zh)
WO (1) WO2020010503A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113055495A (zh) * 2021-03-31 2021-06-29 杭州海康威视系统技术有限公司 一种数据处理方法、装置及分布式存储系统
CN113312139A (zh) * 2020-02-26 2021-08-27 株式会社日立制作所 信息处理系统和方法
CN115297131A (zh) * 2022-08-01 2022-11-04 东北大学 一种基于一致性哈希的敏感数据分布式储存方法

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110169040B (zh) * 2018-07-10 2021-09-28 深圳花儿数据技术有限公司 基于多层一致性哈希的分布式数据存储方法与系统
CN110531936B (zh) * 2019-08-29 2021-05-28 西安交通大学 基于多种存储介质的分布式纠删码混合存储的林型存储结构及方法
CN110737658B (zh) * 2019-09-06 2020-12-18 平安国际智慧城市科技股份有限公司 数据分片存储方法、装置、终端及可读存储介质
CN112579551B (zh) * 2019-09-30 2024-08-30 北京金山云网络技术有限公司 数据存储和读取方法、装置、客户端、管理服务器及系统
US20210103827A1 (en) * 2019-10-07 2021-04-08 International Business Machines Corporation Ontology-based data storage for distributed knowledge bases
CN112835885B (zh) * 2019-11-22 2023-09-01 北京金山云网络技术有限公司 一种分布式表格存储的处理方法、装置及系统
CN112887116A (zh) * 2019-11-29 2021-06-01 伊姆西Ip控股有限责任公司 管理分布式应用系统中的应用节点的方法、设备和产品
CN110737668B (zh) * 2019-12-17 2020-12-22 腾讯科技(深圳)有限公司 数据存储方法、数据读取方法、相关设备及介质
JP7191003B2 (ja) * 2019-12-17 2022-12-16 株式会社日立製作所 ストレージシステムおよびストレージ管理方法
CN111159193B (zh) * 2019-12-27 2023-08-29 掌迅亿通(北京)信息科技有限公司 多层一致性哈希环及其在创建分布式数据库中的应用
CN111209341B (zh) * 2020-01-07 2023-03-14 北京众享比特科技有限公司 区块链的数据存储方法、装置、设备及介质
CN111240899B (zh) * 2020-01-10 2023-07-25 北京百度网讯科技有限公司 状态机复制方法、装置、系统及存储介质
CN111309796B (zh) * 2020-02-07 2023-09-26 腾讯科技(深圳)有限公司 一种数据处理方法、装置以及计算机可读存储介质
EP4155946A4 (en) * 2020-05-18 2024-01-10 Cambricon (Xi'an) Semiconductor Co., Ltd. METHOD AND DEVICE FOR ALLOCATING STORAGE ADDRESS FOR STORED DATA
CN111858490B (zh) * 2020-07-22 2024-01-30 浪潮云信息技术股份公司 一种基于dbDedup的分布式数据库存储通信压缩方法
CN112000285B (zh) * 2020-08-12 2024-09-24 广州市百果园信息技术有限公司 强一致存储系统、数据强一致存储方法、服务器及介质
CN112311596B (zh) * 2020-10-22 2023-05-12 深圳前海微众银行股份有限公司 数据管理方法、装置、设备及计算机存储介质
CN112181732B (zh) * 2020-10-29 2024-09-10 第四范式(北京)技术有限公司 参数服务器节点的恢复方法和恢复系统
CN112307045A (zh) * 2020-11-11 2021-02-02 支付宝(杭州)信息技术有限公司 一种数据同步方法及系统
CN112910981B (zh) * 2021-01-27 2022-07-26 联想(北京)有限公司 一种控制方法及装置
CN112559257B (zh) * 2021-02-19 2021-07-13 深圳市中科鼎创科技股份有限公司 基于数据筛选的数据存储方法
CN115079935A (zh) * 2021-03-15 2022-09-20 伊姆西Ip控股有限责任公司 用于存储和查询数据的方法、电子设备和计算机程序产品
CN113157715B (zh) * 2021-05-12 2022-06-07 厦门大学 纠删码数据中心机架协同更新方法
CN113220236B (zh) * 2021-05-17 2024-01-30 北京青云科技股份有限公司 一种数据管理方法、系统及设备
EP4092963B1 (en) * 2021-05-20 2024-05-08 Ovh Method and system for datacenter network device maintenance
CN113360095B (zh) * 2021-06-04 2023-02-17 重庆紫光华山智安科技有限公司 硬盘数据管理方法、装置、设备及介质
CN113504881B (zh) * 2021-09-13 2021-12-24 飞狐信息技术(天津)有限公司 热点数据的处理方法、客户端、目标计算设备及装置
CN113779089A (zh) * 2021-09-14 2021-12-10 杭州沃趣科技股份有限公司 一种保持数据库热点数据方法、装置、设备及介质
CN113794558B (zh) * 2021-09-16 2024-02-27 烽火通信科技股份有限公司 一种XMSS算法中的L-tree计算方法、装置及系统
CN113868720A (zh) * 2021-09-27 2021-12-31 北京金山云网络技术有限公司 数据处理方法及装置
CN114327293B (zh) * 2021-12-31 2023-01-24 北京百度网讯科技有限公司 一种数据读方法、装置、设备以及存储介质
CN114297172B (zh) * 2022-01-04 2022-07-12 北京乐讯科技有限公司 一种基于云原生的分布式文件系统
CN114640690B (zh) * 2022-05-17 2022-08-23 浙江省公众信息产业有限公司无线运营分公司 一种文件存储方法、系统、介质和设备
CN115242819B (zh) * 2022-07-22 2024-07-16 济南浪潮数据技术有限公司 分布式存储的选路方法及相关组件
CN115145497B (zh) * 2022-09-06 2022-11-29 深圳市杉岩数据技术有限公司 一种基于分布式存储的卷数据在线迁移方法
CN116010430B (zh) * 2023-03-24 2023-06-20 杭州趣链科技有限公司 数据恢复方法、数据库系统、计算机设备和存储介质
CN117061324B (zh) * 2023-10-11 2023-12-15 佳瑛科技有限公司 一种业务数据处理方法以及分布式系统
CN117056431B (zh) * 2023-10-11 2024-02-09 中电数创(北京)科技有限公司 基于hbase亲和性计算的二阶段调度的分布式执行方法和系统
CN117614956B (zh) * 2024-01-24 2024-03-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种分布式存储的网内缓存方法、系统以及储存介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102904948A (zh) * 2012-09-29 2013-01-30 南京云创存储科技有限公司 一种超大规模低成本存储系统
CN103929500A (zh) * 2014-05-06 2014-07-16 刘跃 一种分布式存储系统的数据分片方法
CN104378447A (zh) * 2014-12-03 2015-02-25 深圳市鼎元科技开发有限公司 一种基于哈希环的非迁移分布式存储方法及系统

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089293B2 (en) * 2000-11-02 2006-08-08 Sun Microsystems, Inc. Switching system method for discovering and accessing SCSI devices in response to query
US7716180B2 (en) * 2005-12-29 2010-05-11 Amazon Technologies, Inc. Distributed storage system with web services client interface
US9317536B2 (en) * 2010-04-27 2016-04-19 Cornell University System and methods for mapping and searching objects in multidimensional space
US8782339B2 (en) * 2010-10-11 2014-07-15 Open Invention Network, Llc Storage system having cross node data redundancy and method and computer readable medium for same
US10489412B2 (en) * 2012-03-29 2019-11-26 Hitachi Vantara Corporation Highly available search index with storage node addition and removal
US9104560B2 (en) * 2012-06-13 2015-08-11 Caringo, Inc. Two level addressing in storage clusters
US9256622B2 (en) * 2012-12-21 2016-02-09 Commvault Systems, Inc. Systems and methods to confirm replication data accuracy for data backup in data storage systems
US20150006846A1 (en) * 2013-06-28 2015-01-01 Saratoga Speed, Inc. Network system to distribute chunks across multiple physical nodes with disk support for object storage
US9697226B1 (en) * 2013-06-28 2017-07-04 Sanmina Corporation Network system to distribute chunks across multiple physical nodes
CN105659213B (zh) * 2013-10-18 2018-12-14 株式会社日立制作所 无共享分布式存储系统中的目标驱动独立数据完整性和冗余恢复
US9483481B2 (en) * 2013-12-06 2016-11-01 International Business Machines Corporation Files having unallocated portions within content addressable storage
US10417102B2 (en) * 2016-09-30 2019-09-17 Commvault Systems, Inc. Heartbeat monitoring of virtual machines for initiating failover operations in a data storage management system, including virtual machine distribution logic
CN109597567B (zh) * 2017-09-30 2022-03-08 网宿科技股份有限公司 一种数据处理方法和装置
CN110169040B (zh) * 2018-07-10 2021-09-28 深圳花儿数据技术有限公司 基于多层一致性哈希的分布式数据存储方法与系统
CN110169008B (zh) * 2018-07-10 2022-06-03 深圳花儿数据技术有限公司 一种基于一致性哈希算法的分布式数据冗余存储方法
CN110737658B (zh) * 2019-09-06 2020-12-18 平安国际智慧城市科技股份有限公司 数据分片存储方法、装置、终端及可读存储介质
CN110737668B (zh) * 2019-12-17 2020-12-22 腾讯科技(深圳)有限公司 数据存储方法、数据读取方法、相关设备及介质
US11663192B2 (en) * 2020-12-10 2023-05-30 Oracle International Corporation Identifying and resolving differences between datastores
US11481291B2 (en) * 2021-01-12 2022-10-25 EMC IP Holding Company LLC Alternative storage node communication channel using storage devices group in a distributed storage system
CN114816225A (zh) * 2021-01-28 2022-07-29 北京金山云网络技术有限公司 存储集群的管理方法、装置、电子设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102904948A (zh) * 2012-09-29 2013-01-30 南京云创存储科技有限公司 一种超大规模低成本存储系统
CN103929500A (zh) * 2014-05-06 2014-07-16 刘跃 一种分布式存储系统的数据分片方法
CN104378447A (zh) * 2014-12-03 2015-02-25 深圳市鼎元科技开发有限公司 一种基于哈希环的非迁移分布式存储方法及系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312139A (zh) * 2020-02-26 2021-08-27 株式会社日立制作所 信息处理系统和方法
CN113312139B (zh) * 2020-02-26 2024-05-28 株式会社日立制作所 信息处理系统和方法
CN113055495A (zh) * 2021-03-31 2021-06-29 杭州海康威视系统技术有限公司 一种数据处理方法、装置及分布式存储系统
CN113055495B (zh) * 2021-03-31 2022-11-04 杭州海康威视系统技术有限公司 一种数据处理方法、装置及分布式存储系统
CN115297131A (zh) * 2022-08-01 2022-11-04 东北大学 一种基于一致性哈希的敏感数据分布式储存方法

Also Published As

Publication number Publication date
US11461203B2 (en) 2022-10-04
US20210208987A1 (en) 2021-07-08
CN110169040B (zh) 2021-09-28
CN110169040A (zh) 2019-08-23

Similar Documents

Publication Publication Date Title
WO2020010503A1 (zh) 基于多层一致性哈希的分布式数据存储方法与系统
US11775392B2 (en) Indirect replication of a dataset
US10860547B2 (en) Data mobility, accessibility, and consistency in a data storage system
US11288248B2 (en) Performing file system operations in a distributed key-value store
US9305072B2 (en) Information storage system and data replication method thereof
US10268593B1 (en) Block store managamement using a virtual computing system service
JP5918243B2 (ja) 分散型データベースにおいてインテグリティを管理するためのシステム及び方法
CA2913036C (en) Index update pipeline
US7788303B2 (en) Systems and methods for distributed system scanning
US8707098B2 (en) Recovery procedure for a data storage system
US20200117362A1 (en) Erasure coding content driven distribution of data blocks
US10185507B1 (en) Stateless block store manager volume reconstruction
US9411682B2 (en) Scrubbing procedure for a data storage system
JP2013544386A5 (zh)
US10387273B2 (en) Hierarchical fault tolerance in system storage
US20190236302A1 (en) Augmented metadata and signatures for objects in object stores
US11507283B1 (en) Enabling host computer systems to access logical volumes by dynamic updates to data structure rules
US10223184B1 (en) Individual write quorums for a log-structured distributed storage system
US10921991B1 (en) Rule invalidation for a block store management system
US11467908B2 (en) Distributed storage system, distributed storage node, and parity update method for distributed storage system
US10809920B1 (en) Block store management for remote storage systems
CN112965859A (zh) 一种基于ipfs集群的数据灾备方法与设备
JP6671708B2 (ja) バックアップリストアシステム及びバックアップリストア方法
WO2023197937A1 (zh) 数据处理方法及其装置、存储介质、计算机程序产品
CN108366217B (zh) 监控视频采集存储方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926318

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 03.03.2021.)

122 Ep: pct application non-entry in european phase

Ref document number: 18926318

Country of ref document: EP

Kind code of ref document: A1