WO2020083106A1 - Node expansion method in storage system and storage system - Google Patents

Node expansion method in storage system and storage system Download PDF

Info

Publication number
WO2020083106A1
WO2020083106A1 PCT/CN2019/111888 CN2019111888W WO2020083106A1 WO 2020083106 A1 WO2020083106 A1 WO 2020083106A1 CN 2019111888 W CN2019111888 W CN 2019111888W WO 2020083106 A1 WO2020083106 A1 WO 2020083106A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
node
partition group
data
partition
Prior art date
Application number
PCT/CN2019/111888
Other languages
French (fr)
Chinese (zh)
Inventor
肖建龙
王�锋
王奇
王晨
谭春华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811571426.7A external-priority patent/CN111104057B/en
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19877173.5A priority Critical patent/EP3859506B1/en
Publication of WO2020083106A1 publication Critical patent/WO2020083106A1/en
Priority to US17/239,194 priority patent/US20210278983A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers

Definitions

  • This application relates to the storage field, and more specifically, to a node expansion method and storage system in a storage system.
  • capacity expansion is required when the storage system has insufficient free space.
  • the original node When a new node joins the storage system, the original node will migrate a part of the partition and its corresponding data to the new node. Data migration between storage nodes is bound to consume bandwidth.
  • This application provides a node expansion method and a storage system in a storage system, which can save bandwidth between storage nodes.
  • a node expansion method in a storage system includes one or more first nodes. For each first node, both data and metadata of the data are stored. According to this method, the first node configures the data partition group and the metadata partition group of the node, the data partition group includes multiple data partitions, the metadata partition group includes multiple metadata partitions, and the data partition group corresponds to The metadata of the data is a subset of the metadata corresponding to the metadata partition group.
  • the meaning of the subset is that the number of data partitions included in the data partition group is less than the number of metadata partitions included in the metadata partition group, and the metadata corresponding to a part of the metadata partitions included in the metadata partition group is used In order to describe the data corresponding to the data partition group, the metadata corresponding to another part of the metadata partition is used to describe the data corresponding to the other data partition group.
  • the first node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-metadata among the at least two sub-metadata partition groups The partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
  • the sub-metadata partition group of the first node split and its corresponding metadata are migrated to the second node, because the amount of metadata is much less than
  • the data volume of the data saves the bandwidth between the nodes as compared with the data migration to the second node in the prior art.
  • the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group, Then even if the metadata partition group is split into at least two sub-metadata partition groups after capacity expansion, it can still ensure to a certain extent that the metadata of the data corresponding to the data partition group is the metadata corresponding to any sub-metadata partition group Subset, after migrating one of the sub-metadata partition groups and their corresponding metadata to the second node, the data corresponding to the data partition group is still described by the metadata stored in the same node, avoiding Modify the data, especially the metadata on different nodes when performing garbage collection.
  • the first node obtains the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.
  • the layout of the metadata partition group after capacity expansion includes all After the second node joins the storage system, the number of the sub-metadata partition groups configured by each node in the storage system, and the number of meta-data partitions included in the sub-metadata partition group, before the expansion
  • the layout of the metadata partition group includes the number of the metadata partition groups configured by the first node before the second node joins the storage system, and the number of metadata partitions included in the metadata partition group.
  • the first node splits the metadata partition group into at least two sub-metadata partition groups according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.
  • the first node splits the data partition group into at least two sub-data partition groups, and the sub-data partition groups correspond to The metadata of the data is a subset of the metadata corresponding to the sub-metadata partition group. Splitting the data partition group into smaller data sub-data partition groups is to prepare for the next expansion, so that the metadata of the data corresponding to the sub-data partition group is always a subset of the metadata corresponding to the sub-metadata partition group .
  • the first node when the second node joins the storage system, the first node maintains correspondence between the data partition group and the data partition group Of data continues to be stored in the first node. Since only the metadata is migrated, the data is not migrated, and the data volume of the metadata is usually much smaller than the data volume, so the bandwidth between the nodes is saved.
  • the metadata of the data corresponding to the data partition group corresponds to any one of the at least two sub-metadata partition groups Subset of metadata.
  • the metadata of the data corresponding to the data partition group corresponds to any one of the at least two sub-metadata partition groups Subset of metadata.
  • a node expansion device for implementing the first aspect and any method provided by the implementation thereof.
  • a storage node is provided for implementing the method of the first aspect and any of its implementations.
  • a computer program product of a node expansion method including a computer-readable storage medium storing program code, and the instructions included in the program code are used to perform the operations described in the first aspect and any one of its implementations method.
  • a storage system includes at least a first node and a third node.
  • data and metadata describing the data are stored in different nodes, for example, data is stored in the first node, and metadata of the data is stored in the second node .
  • the first node is used to configure a data partition group
  • the data partition group corresponds to the data
  • the third node is used to configure a metadata partition group
  • the metadata of the data corresponding to the configured data partition group is the configuration A subset of metadata corresponding to the subsequent metadata partition group.
  • the third node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-metadata among the at least two sub-metadata partition groups
  • the partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
  • the data and the metadata of the data are stored in different nodes, since the data partition group and the metadata partition group are configured in the same way as the first aspect, after the migration The metadata that can still satisfy the data corresponding to any one data partition group is stored in one node, and there is no need to go to two nodes to obtain or modify the metadata.
  • a node expansion method is provided, which is applied to the storage system provided in the fifth aspect, and the first node in the storage system performs the function provided in the fifth aspect.
  • a node capacity expansion device which is located in the storage system provided in the fifth aspect and is used to perform the function provided in the fifth aspect.
  • a node expansion method in a storage system includes one or more first nodes. For each first node, both data and metadata of the data are stored. And the first node includes at least two metadata partition groups and at least two data partition groups, and the metadata corresponding to each metadata partition group is used to describe the data corresponding to one of the data partition groups. The first node configures the metadata partition group and the data partition group so that the number of metadata partitions included in the metadata partition group is equal to the number of data partitions included in the data partition group.
  • the second node joins the storage system, the first metadata partition group of the at least two metadata partition groups and the metadata corresponding to the first metadata partition group are migrated to the second node. The data corresponding to the at least two data partition groups continues to be maintained in the first node.
  • the metadata that can satisfy the data corresponding to any one data partition group is stored in one node, and there is no need to go to two nodes to obtain or modify the metadata.
  • a node expansion method is provided, which is applied to the storage system provided in the eighth aspect, and the first node in the storage system performs the function provided in the eighth aspect.
  • a node capacity expansion device which is located in the storage system provided by the fifth aspect and is used to perform the function provided by the eighth aspect.
  • FIG. 1 is a schematic diagram of a scenario to which the technical solutions of embodiments of the present invention can be applied.
  • FIG. 2 is a schematic diagram of a storage unit according to an embodiment of the invention.
  • FIG. 3 is a schematic diagram of a metadata partition group and a data partition group provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of another metadata partition group and data partition group provided by an embodiment of the present invention.
  • FIG. 5 is a schematic layout diagram of a metadata partition before expansion provided by an embodiment of the present invention.
  • FIG. 6 is a schematic layout diagram of a metadata partition after capacity expansion provided by an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of a node expansion method according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a node expansion device provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a storage node provided by an embodiment of the present invention.
  • the metadata when the capacity is expanded, the metadata is migrated to the newly added node, and the retained data continues to be stored in the original node. And through configuration, it is always ensured that the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group, so that the data corresponding to one data partition group is only described by the metadata stored in one node. So as to achieve the purpose of saving bandwidth.
  • the technical solutions of the embodiments of the present application can be applied to various storage systems.
  • the technical solutions of the embodiments of the present application are described below by taking a distributed storage system as an example, but the embodiments of the present application are not limited thereto.
  • a distributed storage system data is distributed and stored on multiple storage nodes, and multiple storage nodes share the storage load. This storage method not only improves the reliability, availability, and access efficiency of the system, but is also easy to expand.
  • the storage device is, for example, a storage server, or a combination of a storage controller and a storage medium.
  • FIG. 1 is a schematic diagram of a scenario to which the technical solutions of the embodiments of the present application can be applied.
  • a client server 101 and a storage system 100 communicate with each other.
  • the storage system 100 includes a switch 103 and a plurality of storage nodes (or “nodes” for short) 104 and the like.
  • the switch 103 is an optional device.
  • Each storage node 104 may include multiple hard disks or other types of storage media (such as solid-state hard disks or shingled magnetic recording) for storing data.
  • a distributed hash table (Distributed Hash Table, DHT) method is generally used for routing, but the embodiment of the present application does not limit this. That is to say, in the technical solutions of the embodiments of the present application, various possible routing methods in the storage system may be adopted.
  • the distributed hash table method the hash ring is evenly divided into several parts, each part is called a partition (partition), and each partition corresponds to a set size of storage space. It can be understood that the more partitions, the smaller the storage space corresponding to each partition, and the fewer the partitions, the larger the storage space corresponding to each partition.
  • partitions are used as an example.
  • these partitions are divided into multiple partition groups, and each partition group contains the same number of partitions.
  • each partition group contains the same number of partitions.
  • partition group 143 includes partition 4066-partition 4095.
  • the partition group has its own identifier, which is used to uniquely identify the partition group.
  • the partition has its own identification, which is used to uniquely identify the partition.
  • the identifier can be a number, a character string, or a combination of a number and a character string.
  • each partition group corresponds to one storage node 104.
  • the meaning of "corresponding" means that all data of the same partition group located by the hash value will be stored in the same storage node 104.
  • the client server 101 sends a write request to any storage node 104, where the write request carries the data to be written and the virtual address of the data.
  • the virtual address refers to the identifier and offset of the logical unit (LU) to which the data is to be written, and the virtual address is an address that is visible to the client server 101.
  • the storage node 104 that has received the write request performs a hash operation according to the virtual address of the data to obtain a hash value, and the hash value can uniquely determine a target partition. After the target partition is determined, the partition group where the target partition is located is also determined. According to the correspondence between the partition group and the storage node, the storage node that receives the write request may forward the write request to the partition group Corresponding storage node.
  • One partition group corresponds to one or more storage nodes.
  • the corresponding storage node (in order to distinguish it from other storage nodes 104, referred to herein as the first storage node) writes the write request to its cache, and performs persistent storage when the conditions are met.
  • each storage node includes at least one storage unit.
  • the storage unit is a logical space, and the actual physical space still comes from multiple storage nodes.
  • FIG. 2 is a schematic structural diagram of a storage unit provided by this embodiment.
  • a storage unit is a collection containing multiple logical blocks.
  • a logical block is a spatial concept. Its size is 4MB as an example, but it is not limited to 4MB.
  • One storage node 104 (still taking the first storage node as an example) uses or manages the storage space of other storage nodes 104 in the storage system 100 in the form of logical blocks. The logical blocks on the hard disks from different storage nodes 104 can form a logical block set.
  • the storage node 104 then divides the logical block set into data storage according to the set Redundant Array of Independent Disks (RAID) type. Unit and check storage unit.
  • the set of logical blocks containing the data storage unit and the verification storage unit is called a storage unit.
  • the data storage unit includes at least two logical blocks for storing data distribution
  • the check storage unit includes at least one check logical block for storing check fragments.
  • the set of logical blocks containing the data storage unit and the verification storage unit is called a storage unit.
  • a logical block is taken from each of the six storage nodes to form a logical block set, and then the first storage node groups the logical blocks in the logical block set according to the RAID type (taking RAID 6 as an example), for example, logical block 1, logical block 2.
  • Logic block 3 and logic block 4 form a data storage unit
  • logic block 5 and logic block 6 form a check storage unit. It can be understood that, according to the redundancy protection mechanism of RAID 6, when any two data units or check units fail, the failed units can be reconstructed according to the remaining data units or check units.
  • the data in the cache of the first storage node When the data in the cache of the first storage node reaches a set threshold, the data can be divided into multiple data fragments according to the set RAID type, and a check fragment can be calculated to divide the data
  • the slices and check fragments are stored in the storage unit. These data fragments and corresponding check fragments constitute a stripe.
  • One storage unit can store multiple stripes, and is not limited to the three stripes shown in FIG. 2. For example, when the data to be stored in the first storage node reaches 32KB (8KB * 4), divide the data into 4 data fragments, each data fragment is 8KB, and then calculate to obtain 2 check fragments , Each check fragment is also 8KB.
  • the first storage node then sends each shard to the storage node where it is located for persistent storage.
  • the data is written into the storage unit of the first storage node. Physically, the data is still stored in multiple storage nodes. For each shard, the identifier of the storage unit where it is located and the location inside the storage unit are the logical addresses of the shards, and the actual address of the shard in the storage node is the physical address.
  • the storage node After the data is stored in the storage node, in order to find these data in the future, we also need to store the description information of these data.
  • the storage node When receiving a read request, the storage node usually finds metadata of the data to be read according to the virtual address carried in the read request, and then further obtains the data to be read according to the metadata.
  • the metadata includes but is not limited to: the correspondence between the logical address and the physical address of each slice, and the correspondence between the virtual address of the data and the logical addresses of the slices contained in the data.
  • the logical address set of each fragment contained in the data is also the logical address of the data.
  • the partition where the metadata is located is also determined according to the virtual address carried in the read request or the write request. Specifically, a hash operation is performed on the virtual address to obtain a hash value, and the hash value A target partition can be uniquely determined, so as to further determine the target partition group where the target partition is located, and then send the metadata to be stored to the storage node (for example, the first storage node) corresponding to the target partition group.
  • the storage node for example, the first storage node
  • the metadata to be stored in the first storage node reaches a set threshold (for example, 32KB)
  • the metadata is divided into 4 data fragments, and then 2 check fragments are calculated. Then send these fragments to multiple storage nodes.
  • the data partition and the metadata partition are independent of each other.
  • data has its own partitioning mechanism
  • metadata also has its own partitioning mechanism.
  • the total number of data partitions and the total number of metadata partitions are the same.
  • the total number of data partitions is 4096
  • the total number of metadata partitions is also 4096.
  • a partition corresponding to data is called a data partition
  • a partition corresponding to metadata is called a metadata partition.
  • the partition group corresponding to data is called a data partition group
  • the partition group corresponding to metadata is called a metadata partition group.
  • the metadata corresponding to a metadata partition is used to describe the data corresponding to the data partition having the same identifier.
  • the metadata corresponding to metadata partition 1 is used to describe the data corresponding to data partition 1
  • the metadata corresponding to metadata partition 2 is used to describe the data corresponding to data partition 2
  • the metadata corresponding to metadata partition N is used to describe the data Data corresponding to partition N, where N is an integer greater than or equal to 2.
  • the data and the metadata of the data may be stored in the same storage node, or may be stored in different storage nodes.
  • the storage node After the metadata is stored, when the storage node receives the read request, it can learn the physical address of the data to be read by reading the metadata. Specifically, when any storage node 104 receives the read request sent by the client server 101, the node 104 performs a hash calculation on the virtual address carried in the read request to obtain a hash value, thereby obtaining the corresponding hash value 'S metadata partition and its metadata partition group. Assuming that the storage unit corresponding to the metadata partition group belongs to the first storage node, the storage node 104 receiving the read request will forward the read request to the first storage node. The first storage node reads the metadata of the data to be read from the storage unit. According to the metadata, the first storage node obtains each piece that constitutes the data to be read from multiple storage nodes, verifies that the data is to be read, and then aggregates the data to be read back to the client server 101.
  • the storage system 100 migrates the partition of the old storage node (abbreviation: old node) and data corresponding to the partition to the new node. For example, assuming that the storage system 100 originally had 8 storage nodes, and then expanded to become 16 storage nodes, then the original 8 storage nodes need to migrate half of the partitions and the data corresponding to these partitions to the newly added 8 storage nodes. .
  • the client server 101 sends a read request to the new node to read the data corresponding to data partition 1, although the data corresponding to data partition 1 is not It migrates to the new node, but it can still find the physical address of the data to be read according to the metadata corresponding to the metadata partition 1, and read the data from the old node.
  • the partition and its data are usually migrated in units of partition groups. If the metadata corresponding to the metadata partition group is less than the metadata used to describe the data corresponding to the data partition group, it will cause the same storage unit to be referenced by at least two metadata partition groups, thereby bringing management inconvenient.
  • each metadata partition group in FIG. 3 contains 32 partitions, and each data partition group contains 64 partitions.
  • the data partition group 1 includes partition 0 to partition 63.
  • the data corresponding to partition 0 to partition 63 are all stored in the storage unit 1
  • the metadata partition group 1 includes partition 0 to partition 31
  • the metadata partition group 2 includes partition 32 to partition 63. It can be seen that all the partitions included in the metadata partition group 1 and the metadata partition group 2 are used to describe the data in the storage unit 1.
  • the metadata partition group 1 and the metadata partition group 2 point to the storage unit 1 respectively.
  • the metadata partition group 1 in the old node and its corresponding metadata are migrated to the new storage node.
  • the metadata partition group 1 in the new node points to the storage unit 1.
  • the metadata partition group 2 in the old node is not migrated, and still points to the storage unit 1.
  • the storage unit 1 is simultaneously referenced by the metadata partition group 2 in the old node and the metadata partition group 1 in the new node.
  • the number of partitions included in the metadata partition group is set to be greater than or equal to the number of partitions included in the data partition group.
  • the metadata corresponding to one metadata partition group is more than or equal to the metadata used to describe the data corresponding to one data partition group.
  • each metadata partition group contains 64 partitions
  • each data partition group contains 32 partitions.
  • the metadata partition group 1 includes partitions 0-63
  • the data partition group 1 includes partitions 0-31
  • the data partition group 2 includes partitions 32-63.
  • the data corresponding to the data partition group 1 is stored in the storage unit 1
  • the data corresponding to the data partition group 2 is stored in the storage unit 2.
  • the metadata partition group 1 in the old node points to storage unit 1 and storage unit 2, respectively.
  • the metadata partition group 1 and its corresponding metadata are migrated to the new storage node.
  • the metadata partition group 1 in the new node points to storage unit 1 and storage unit 2, respectively. Since the old node no longer exists
  • the metadata partition group 1 is described, so its pointing relationship is deleted (indicated by a dotted arrow). It can be seen that both the storage unit 1 and the storage unit 2 are only referenced by one metadata partition group, which simplifies the management complexity.
  • the metadata partition group and the data partition group are configured before capacity expansion, so that the number of partitions included in the metadata partition group is set to be greater than the number of partitions included in the data partition group.
  • the metadata partition group in the old node is split into at least two sub-metadata partition groups, and then at least one sub-metadata partition group and its corresponding metadata are migrated to the new node.
  • the data partition group in the old node into at least two sub-data partition groups, so that the number of partitions included in the sub-metadata partition group is set to be greater than or equal to the number of partitions included in the sub-data partition group.
  • FIG. 5 is a distribution diagram of the metadata partition groups of each storage node before expansion.
  • the number of partition groups allocated to each storage node may be preset.
  • the embodiment of the present invention may also set each processing unit to correspond to a certain number of partition groups.
  • the processing unit refers to the CPU. As shown in Table 1:
  • FIG. 6 is a distribution diagram of metadata partition groups of each storage node after capacity expansion.
  • the storage system 100 has 5 storage nodes at this time.
  • 5 storage nodes have a total of 40 processing units, and each processing unit is configured with 6 partition groups, so 5 storage nodes have a total of 240 partition groups.
  • a part of the first partition group of the three nodes before the expansion can be split into two second partition groups, and then according to Figure 6
  • the partition distribution of each node is shown, and some first partition groups and second partition groups are migrated from 3 nodes to node 4 and node 5.
  • FIG. 5 there are 112 first partition groups in the storage system 100 before capacity expansion, and 16 first partition groups after capacity expansion, so 96 of the 112 first partition groups need to be in the 112 first partition groups Split.
  • first partition groups are split into 192 second partition groups, so after the split, there are 16 first partition groups and 224 second partitions on the 3 nodes, but each node migrates part of the first partition group and part The second partition group into node 4 and node 5.
  • the processing unit 1 of the node 1 as an example, according to FIG. 5, the processing unit 1 before expansion is configured with 4 first partition groups and 3 second partition groups, and the processing unit 1 after expansion according to FIG. 5 is configured with 1 First partition group and 5 partition groups. This shows that there are three first partition groups in the processing unit 1 that need to be migrated out, or split into multiple second partition groups and then migrated out.
  • this embodiment does not impose any restrictions on this, as long as the migration meets the requirements shown in FIG. 6 It can be distributed by partition.
  • the processing units of the remaining nodes also migrate and split in the same way.
  • the three storage nodes before expansion split some of the first partition group into the second partition group and then migrate to the new node.
  • part of the first partition group can also be migrated to After the new node is split again, the partition distribution shown in FIG. 6 can also be achieved.
  • the above description and the example of FIG. 5 are for the metadata partition group.
  • the number of data partitions contained in each data partition group needs to be smaller than that contained in the metadata partition group The number of metadata partitions. Therefore, the data partition group needs to be split after the migration is completed, and the number of partitions included in the sub-data partition group after the split must be smaller than the number of metadata partitions included in the sub-metadata partition group.
  • the purpose of the split is to make the metadata corresponding to the current metadata partition group always contain metadata describing the data corresponding to the current data partition group. Since some of the metadata partition groups in the above example contain 32 metadata partitions, and some metadata partition groups contain 16 metadata partitions, the number of data partition groups contained in the sub-data partition group obtained after the split may be 16, or 8, or 4, or 2, in short, cannot exceed 16.
  • garbage collection may be started.
  • garbage collection is performed in units of storage units.
  • Select a storage unit as the object of garbage collection migrate the valid data in this storage unit to a new storage unit, and then release the storage space occupied by the original storage unit.
  • the selected storage unit needs to meet certain conditions, for example, the junk data contained in the storage unit reaches the first set threshold, or the storage unit is the storage unit that contains the most junk data in the plurality of storage units, or the storage The valid data contained in the unit is lower than the second set threshold, or the storage unit is the storage unit containing the least valid data among the plurality of storage units.
  • the selected storage unit for garbage collection is called a first storage unit or a storage unit 1.
  • the main body of garbage collection is the storage node to which the storage unit 1 belongs (continue to take the first storage node as an example).
  • the first storage node reads valid data from the storage unit 1 and writes the valid data to a new storage unit . Then, mark all data in the storage unit 1 as invalid, and send a delete request to the storage node where each shard is located to delete the shard. Finally, the first storage node also needs to modify the metadata used to describe the data in the storage unit 1. As can be seen from FIG.
  • the metadata corresponding to the metadata partition group 2 and the metadata corresponding to the metadata partition group 1 are metadata used to describe the data in the storage unit 1, and the metadata partition group 2 and the metadata Partition group 1 is located in different storage nodes. Therefore, the first storage node needs to modify the metadata in the two storage nodes respectively. During the modification process, multiple read requests and write requests will be generated between the nodes, which seriously consumes bandwidth resources between the nodes.
  • the garbage collection method of the embodiment of the present invention will be described by taking garbage collection on the storage unit 2 as an example.
  • the execution subject of garbage collection is the storage node to which the storage unit 2 belongs (taking the second storage node as an example).
  • the second storage node reads valid data from the storage unit 2 and writes the valid data into a new storage unit. Then, mark all data in the storage unit 2 as invalid, and send a delete request to the storage node where each shard is located to delete the shard.
  • the second storage node also needs to modify the metadata describing the data in the storage unit 2. It can be seen from FIG.
  • the second storage node only needs to send a request to the storage node where the metadata partition group 1 is located to modify the metadata. Compared with Example 1, since the second storage node only needs to modify metadata on one storage node, the bandwidth resources between the nodes are greatly saved.
  • FIG. 7 is a method flowchart of a node expansion method. This method is applied to the storage system shown in FIG. 1, which includes multiple first nodes.
  • the first node refers to a node that already exists in the storage system before capacity expansion. For details, refer to the node 104 shown in FIG. 1 or FIG. 2.
  • Each first node may perform the node expansion method according to the steps shown in FIG. 7.
  • S701 Configure a data partition group and a metadata partition group of the first node.
  • the data partition group includes multiple data partitions
  • the metadata partition group includes multiple metadata partitions.
  • the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group.
  • the subset here has two meanings, one is that the metadata corresponding to the metadata partition group contains metadata describing the data corresponding to the data partition group, and the second is that the metadata partition group contains The number of metadata partitions is larger than the data of the data partitions included in the data partition group.
  • the data partition group contains M data partitions, namely data partition 1, data partition 2, ... data partition M.
  • the metadata partition group includes N metadata partitions, where N is greater than M, namely, metadata partition 1, metadata partition 2, ..., metadata partition M, and metadata partition N.
  • N is greater than M
  • the metadata corresponding to metadata partition 1 is used to describe the data corresponding to data partition 1
  • the metadata corresponding to metadata partition 2 is used to describe the data corresponding to data partition 2
  • the metadata corresponding to metadata partition M is used to The data corresponding to the data partition M is described. Therefore, the metadata partition group contains all metadata describing data corresponding to the M data partitions.
  • the metadata partition group also contains metadata describing data corresponding to other data partition groups.
  • the first node described in S701 is the old node described in the expansion section.
  • the layout of the expanded metadata partition group includes the number of the sub-metadata partition groups configured for each node in the storage system after the second node joins the storage system, and the sub-metadata partition group includes The number of metadata partitions of the system, the layout of the metadata partition group before capacity expansion includes the number of the metadata partition groups configured by the first node before the second node joins the storage system, and the metadata partition The number of metadata partitions contained in the group. For specific implementation, refer to the description related to FIG. 5 and FIG. 6 in the capacity expansion section.
  • Splitting this action refers to the change of the mapping relationship in actual implementation. Specifically, there is a mapping relationship between the identifier of the original metadata partition group before the split and the identifier of each metadata partition contained in the metadata partition group, and at least two sub metadata partition group IDs are added after the split , And delete the mapping relationship between the identifier of the metadata partition contained in the original metadata partition group and the identifier of the original metadata partition group, and create a part of the original metadata partition group The mapping relationship between the ID of the metadata partition and the ID of one of the sub-metadata partition groups, and the ID of another part of the metadata partition contained in the original metadata partition group and the ID of the other sub-metadata partition group The mapping relationship between identifiers.
  • S703 Migrate a sub-metadata partition group and its corresponding metadata to the second node.
  • the second node is the new node described in the expansion section.
  • the migration of the partition group refers to the change of the attribution relationship. Specifically, the migration of the sub-metadata partition group to the second node specifically refers to the correspondence between the sub-metadata partition group and the first node The relationship is modified to correspond to the second node.
  • the migration of metadata refers to the actual movement of data. Specifically, the migration of metadata corresponding to a sub-metadata partition group to the second node refers to copying the metadata to the second node and deleting all The metadata retained by the first node.
  • the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group, so Even if the metadata partition group is split into at least two sub-metadata partition groups, the metadata of the data corresponding to the data partition group is still a subset of the metadata corresponding to one of the sub-metadata partition groups. Then when one of the child metadata partition groups and their corresponding metadata is migrated to the second node, the data corresponding to the data partition group is still described by the metadata stored in one node, which avoids The data is modified, especially the metadata is modified on different nodes when garbage collection is performed.
  • the The data partition group is split into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to the sub-metadata partition group.
  • the definition of split here is the same as the split in S702.
  • the data and the metadata of the data are stored in the same node, but in another scenario, the data and the metadata of the data are stored in different nodes, Then for a node, although it may also include a data partition group and a metadata partition group, the metadata corresponding to the metadata partition group may not be the metadata of the data corresponding to the data partition group, but Metadata of data stored on other nodes.
  • each first node still needs to configure the data partition group and the metadata partition group in the node.
  • the configured metadata partition group contains more metadata partitions than the data partition group The number of data partitions is sufficient.
  • each first node After the second node joins the storage system, each first node then splits the metadata partition group according to the description of S702, and migrates a split child metadata partition group to the second node. Since each first node has configured its data partition group and metadata partition group in this way, after migration, the data corresponding to a data partition group can still be described by the metadata stored in the same node. As a specific example, data is stored in the first node, and metadata of the data is stored in the third node. Then, the first node configures the data partition group corresponding to the data, and the third node configures the metadata partition group corresponding to the metadata. After configuration, the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the configured metadata partition group.
  • the third node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-group of the at least two sub-metadata partition groups
  • the metadata partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
  • the number of data partitions included in the data partition group is smaller than the number of metadata partitions included in the metadata partition group.
  • the number of data partitions included in the data partition group is equal to the number of metadata partitions included in the metadata partition group.
  • the number of metadata partitions if the second node joins the storage system, there is no need to split the metadata partition group, but directly divide a part of the metadata partition groups among the multiple metadata partition groups in the first node and The corresponding metadata is migrated to the second node. Similarly, there are two cases for this scenario.
  • Case 1 if the data and the metadata of the data are stored in the same node, then each first node needs to include the metadata corresponding to the metadata partition group only including the data corresponding to the data partition group in the node Metadata.
  • Case 2 if the data and the metadata of the data are stored in different nodes, then each first node needs to configure the number of metadata partitions contained in the metadata partition group to be equal to the data contained in the data partition group The number of partitions. In either case 1 or 2, there is no need to split the metadata partition group, and directly migrate a part of the metadata partition groups and corresponding metadata in the multiple metadata partition groups in the node to the second Node. But this scenario is not suitable for nodes that contain only one metadata partition group.
  • the data partition group and its corresponding data do not need to be migrated to the second node.
  • the second node receives a read request, it can also Its stored metadata finds the physical address of the data to be read, so that the data is read. Since the data volume of the metadata is much smaller than the data volume of the data, avoiding migrating the data to the second node can greatly save the bandwidth between the nodes.
  • FIG. 8 is a schematic structural diagram of the node capacity expansion device.
  • the device includes a configuration module 801, a split module 802, and a migration module 803.
  • the configuration module 801 is configured to configure a data partition group and a metadata partition group of the first node in the storage system.
  • the data partition group includes multiple data partitions
  • the metadata partition group includes multiple metadata partitions.
  • the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group. Specifically, reference may be made to the description of S701 shown in FIG. 7.
  • the splitting module 802 is configured to split the metadata partition group into at least two sub-metadata partition groups when the second node joins the storage system. Specifically, please refer to the description of S702 shown in FIG. 7 and the description related to FIG. 5 and FIG. 6 in the capacity expansion part.
  • the migration module 803 is configured to migrate one sub-metadata partition group and corresponding metadata among the at least two sub-metadata partition groups to the second node. Specifically, reference may be made to the description of S703 shown in FIG. 7.
  • the apparatus further includes an obtaining module 804, configured to obtain a metadata partition group layout after expansion and a metadata partition group layout before expansion, and the metadata partition group layout after expansion includes the second node joining the After the storage system, the number of the sub-metadata partition groups configured for each node in the storage system and the number of meta-data partitions included in the sub-metadata partition group.
  • the layout of the meta-data partition group before capacity expansion includes all The number of the metadata partition groups configured by the first node before the second node joins the storage system, and the number of metadata partitions included in the metadata partition group.
  • the splitting module 802 is specifically configured to split the metadata partition group into at least two sub-metadata partition groups according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.
  • the splitting module 802 is further configured to split the data partition group into at least two sub-data after the at least one sub-metadata partition group and its corresponding metadata are migrated to the second node A partition group, the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to the sub-metadata partition group.
  • the configuration module 801 is further configured to keep the data corresponding to the data partition group to continue to be stored in the first node when the second node joins the storage system.
  • the storage node may be a storage array or a server.
  • the storage node includes a storage controller and a storage medium.
  • the storage node is a server, reference may also be made to the schematic structural diagram of FIG. 9. Therefore, no matter what kind of device the storage node is, at least the processor 901 and the memory 902 are included.
  • a program 903 is stored in the memory 902.
  • the processor 901, the memory 902 and the communication interface are connected through a system bus and complete communication with each other.
  • the processor 901 is a single-core or multi-core central processing unit, or a specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention.
  • the memory 902 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), for example, at least one hard disk memory.
  • the memory 902 is used to store computer-executed instructions.
  • the program 903 may be included in the computer execution instructions. When the storage node is running, the processor 901 runs the program 903 to execute the method flow of S701-S704 shown in FIG. 7.
  • the functions of the configuration module 801, the split module 802, the migration module 803, and the acquisition module 804 shown in FIG. 8 described above may be executed by the processor 901 running the program 903, or executed by the processor 901 alone.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, storage node or data
  • the center transmits to another website, computer, storage node or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available medium integrated storage nodes, data centers, and the like.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, Solid State Disk (SSD)), or the like.
  • a magnetic medium for example, a floppy disk, a hard disk, a magnetic tape
  • an optical medium for example, a DVD
  • a semiconductor medium for example, Solid State Disk (SSD)
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a division of logical functions.
  • there may be other divisions for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application essentially or part of the contribution to the existing technology or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to enable a computer device (which may be a personal computer, a storage node, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the foregoing storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks and other media that can store program codes .

Abstract

Disclosed are a node expansion method in a storage system and a storage system. The storage system comprises a first node configured with a data partition group and a metadata partition group. The data partition group comprises a plurality of data partitions, the metadata partition group comprises a plurality of metadata partitions. Metadata of data corresponding to the data partition group are subsets of metadata corresponding to the metadata partition group. When a second node joins the storage system, the first node splits the metadata partition group into at least two sub metadata partition groups, wherein a first sub metadata partition group and metadata corresponding thereto are migrated to the second node. The expenditure on bandwidth between storage nodes can be reduced.

Description

存储系统中的节点扩容方法和存储系统Node expansion method and storage system in storage system 技术领域Technical field
本申请涉及存储领域,并且更具体地,涉及一种存储系统中的节点扩容方法和存储系统。This application relates to the storage field, and more specifically, to a node expansion method and storage system in a storage system.
背景技术Background technique
分布式存储系统场景下,当存储系统的空闲空间不足时需要进行扩容。当新节点加入存储系统,原来的节点会迁移一部分分区及其分区对应的数据至新节点中。数据在存储节点之间迁移必定会消耗带宽。In a distributed storage system scenario, capacity expansion is required when the storage system has insufficient free space. When a new node joins the storage system, the original node will migrate a part of the partition and its corresponding data to the new node. Data migration between storage nodes is bound to consume bandwidth.
发明内容Summary of the invention
本申请提供了一种存储系统中的节点扩容方法以及存储系统,可以节省存储节点间的带宽。This application provides a node expansion method and a storage system in a storage system, which can save bandwidth between storage nodes.
第一方面,提供了一种存储系统中的节点扩容方法。所述存储系统包括一个或多个第一节点。对于每个第一节点来说,既存储了数据也存储了该数据的元数据。按照该方法,第一节点配置该节点的数据分区组和元数据分区组,所述数据分区组包含多个数据分区,所述元数据分区组包括多个元数据分区,所述数据分区组对应的数据的元数据是所述元数据分区组对应的元数据的子集。子集的含义是,数据分区组所包含的数据分区的数量小于元数据分区组所包含的元数据分区的数量,并且所述元数据分区组所包含的一部分元数据分区对应的元数据是用于描述所述数据分区组对应的数据,而另一部分元数据分区对应的元数据是用于描述其他数据分区组对应的数据的。当第二节点加入所述存储系统时,所述第一节点将所述元数据分区组分裂成至少两个子元数据分区组,将所述至少两个子元数据分区组中的第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点。In the first aspect, a node expansion method in a storage system is provided. The storage system includes one or more first nodes. For each first node, both data and metadata of the data are stored. According to this method, the first node configures the data partition group and the metadata partition group of the node, the data partition group includes multiple data partitions, the metadata partition group includes multiple metadata partitions, and the data partition group corresponds to The metadata of the data is a subset of the metadata corresponding to the metadata partition group. The meaning of the subset is that the number of data partitions included in the data partition group is less than the number of metadata partitions included in the metadata partition group, and the metadata corresponding to a part of the metadata partitions included in the metadata partition group is used In order to describe the data corresponding to the data partition group, the metadata corresponding to another part of the metadata partition is used to describe the data corresponding to the other data partition group. When a second node joins the storage system, the first node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-metadata among the at least two sub-metadata partition groups The partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
按照第一方面提供的方法,当第二节点加入时,将第一节点分裂后的子元数据分区组及其对应的元数据迁移至所述第二节点,由于元数据的数据量大大少于数据的数据量,相对于现有技术中将数据迁移至第二节点来说,节省了节点间的带宽。According to the method provided in the first aspect, when the second node joins, the sub-metadata partition group of the first node split and its corresponding metadata are migrated to the second node, because the amount of metadata is much less than The data volume of the data saves the bandwidth between the nodes as compared with the data migration to the second node in the prior art.
另外,由于对第一节点的数据分区组和元数据分区组进行了配置,使得配置后的所述数据分区组对应的数据的元数据是所述元数据分区组对应的元数据的子集,那么即使在扩容后元数据分区组被分裂成至少两个子元数据分区组,也仍然能在一定程度上保证数据分区组对应的数据的元数据是任意一个子元数据分区组对应的元数据的子集,将其中一个子元数据分区组及其对应的元数据迁移至第二节点之后,所述数据分区组对应的数据仍然是由同一个节点中存储的元数据来描述的,避免了在对数据进行修改,特别是执行垃圾回收时在不同的节点上修改元数据。In addition, because the data partition group and the metadata partition group of the first node are configured, the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group, Then even if the metadata partition group is split into at least two sub-metadata partition groups after capacity expansion, it can still ensure to a certain extent that the metadata of the data corresponding to the data partition group is the metadata corresponding to any sub-metadata partition group Subset, after migrating one of the sub-metadata partition groups and their corresponding metadata to the second node, the data corresponding to the data partition group is still described by the metadata stored in the same node, avoiding Modify the data, especially the metadata on different nodes when performing garbage collection.
结合第一方面的第一种实现,在第二种实现中,所述第一节点获取扩容后元数据分区组布局和扩容前元数据分区组布局,所述扩容后元数据分区组布局包括所述第二节点加入所述存储系统后所述存储系统中每个节点配置的所述子元数据分区组的数量,以及所述子元数据分区组包含的元数据分区的数量,所述扩容前元数据分区组布局包括所述第二节点加入所述存储系统之前所述第一节点配置的所述元数据分区组的数量,以及所述 元数据分区组包含的元数据分区的数量。所述第一节点根据所述扩容后元数据分区组布局和所述扩容前元数据分区组布局,将所述元数据分区组分裂成至少两个子元数据分区组。With reference to the first implementation of the first aspect, in the second implementation, the first node obtains the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion. The layout of the metadata partition group after capacity expansion includes all After the second node joins the storage system, the number of the sub-metadata partition groups configured by each node in the storage system, and the number of meta-data partitions included in the sub-metadata partition group, before the expansion The layout of the metadata partition group includes the number of the metadata partition groups configured by the first node before the second node joins the storage system, and the number of metadata partitions included in the metadata partition group. The first node splits the metadata partition group into at least two sub-metadata partition groups according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.
结合第一方面的以上任意一种实现,在第三种实现中,在迁移之后,所述第一节点将所述数据分区组分裂成至少两个子数据分区组,所述子数据分区组对应的数据的元数据是所述子元数据分区组对应的元数据的子集。将数据分区组分裂成更小粒度的子数据分区组,是为了下一次扩容做准备,使得子数据分区组对应的数据的元数据始终是所述子元数据分区组对应的元数据的子集。With reference to any one of the above implementations of the first aspect, in a third implementation, after the migration, the first node splits the data partition group into at least two sub-data partition groups, and the sub-data partition groups correspond to The metadata of the data is a subset of the metadata corresponding to the sub-metadata partition group. Splitting the data partition group into smaller data sub-data partition groups is to prepare for the next expansion, so that the metadata of the data corresponding to the sub-data partition group is always a subset of the metadata corresponding to the sub-metadata partition group .
结合第一方面的以上任意一种实现,在第四种实现中,当所述第二节点加入所述存储系统时,所述第一节点保持所述数据分区组及其所述数据分区组对应的数据继续存储在所述第一节点中。由于只迁移元数据不迁移数据,而元数据的数据量通常远小于数据的数据量因此节点之间的带宽被节省了。With reference to any one of the above implementations of the first aspect, in a fourth implementation, when the second node joins the storage system, the first node maintains correspondence between the data partition group and the data partition group Of data continues to be stored in the first node. Since only the metadata is migrated, the data is not migrated, and the data volume of the metadata is usually much smaller than the data volume, so the bandwidth between the nodes is saved.
结合第一方面的第一种实现,在第五种实现中,更加明确了所述数据分区组对应的数据的元数据是所述至少两个子元数据分区组中任意一个子元数据分区组对应的元数据的子集。从而保证所述数据分区组对应的数据仍然是由同一个节点中存储的元数据来描述的,避免了在对数据进行修改,特别是执行垃圾回收时在不同的节点上修改元数据。With reference to the first implementation of the first aspect, in the fifth implementation, it is further clarified that the metadata of the data corresponding to the data partition group corresponds to any one of the at least two sub-metadata partition groups Subset of metadata. In order to ensure that the data corresponding to the data partition group is still described by the metadata stored in the same node, it is avoided to modify the data, especially to modify the metadata on different nodes when performing garbage collection.
第二方面,提供了一种节点扩容装置用于实施第一方面及其任意一种实现提供的方法。In a second aspect, a node expansion device is provided for implementing the first aspect and any method provided by the implementation thereof.
第三方面,提供了一种存储节点用于实施第一方面及其任意一种实现提供的方法。In a third aspect, a storage node is provided for implementing the method of the first aspect and any of its implementations.
第四方面,提供了一种节点扩容方法的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令用于执行第一方面及其任意一种实现所描述的方法。In a fourth aspect, a computer program product of a node expansion method is provided, including a computer-readable storage medium storing program code, and the instructions included in the program code are used to perform the operations described in the first aspect and any one of its implementations method.
第五方面,提供了一种存储系统,所述存储系统至少包括第一节点和第三节点。并且,在所述存储系统中,数据和描述该数据的元数据分别存储在不同的节点中,例如数据存储在所述第一节点中,所述数据的元数据存储在所述第二节点中。第一节点,用于配置数据分区组,所述数据分区组对应所述数据,第三节点用于配置元数据分区组,所述配置后的数据分区组对应的数据的元数据是所述配置后的元数据分区组对应的元数据的子集。当第二节点加入所述存储系统时,所述第三节点将所述元数据分区组分裂成至少两个子元数据分区组,将所述至少两个子元数据分区组中的第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点。According to a fifth aspect, a storage system is provided. The storage system includes at least a first node and a third node. Moreover, in the storage system, data and metadata describing the data are stored in different nodes, for example, data is stored in the first node, and metadata of the data is stored in the second node . The first node is used to configure a data partition group, the data partition group corresponds to the data, and the third node is used to configure a metadata partition group, and the metadata of the data corresponding to the configured data partition group is the configuration A subset of metadata corresponding to the subsequent metadata partition group. When the second node joins the storage system, the third node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-metadata among the at least two sub-metadata partition groups The partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
在第五方面提供的存储系统中,虽然数据和所述数据的元数据存储在不同节点中,但由于对其数据分区组和元数据分区组进行了与第一方面相同的配置,在迁移以后仍然能满足任意一个数据分区组对应的数据的元数据均存储在一个节点中,不需要到两个节点去获取或修改元数据。In the storage system provided in the fifth aspect, although the data and the metadata of the data are stored in different nodes, since the data partition group and the metadata partition group are configured in the same way as the first aspect, after the migration The metadata that can still satisfy the data corresponding to any one data partition group is stored in one node, and there is no need to go to two nodes to obtain or modify the metadata.
第六方面,提供了一种节点扩容方法,应用于所述第五方面提供的存储系统中,由所述存储系统中的第一节点执行第五方面所提供的功能。According to a sixth aspect, a node expansion method is provided, which is applied to the storage system provided in the fifth aspect, and the first node in the storage system performs the function provided in the fifth aspect.
第七方面,提供了一种节点扩容装置,位于第五方面提供的存储系统中,用于执行第五方面所提供的功能。According to a seventh aspect, there is provided a node capacity expansion device, which is located in the storage system provided in the fifth aspect and is used to perform the function provided in the fifth aspect.
第八方面,提供了一种存储系统中的节点扩容方法。所述存储系统包括一个或多个第一节点。对于每个第一节点来说,既存储了数据也存储了该数据的元数据。并且所述第 一节点包括至少两个元数据分区组和至少两个数据分区组,每个元数据分区组对应的元数据分别用于描述其中一个数据分区组对应的数据。所述第一节点配置所述元数据分区组和所述数据分区组,使得元数据分区组包含的元数据分区的数量等于数据分区组包含的数据分区的数量。当第二节点加入所述存储系统时,将所述至少两个元数据分区组中的第一元数据分区组以及所述第一元数据分区组对应的元数据迁移至所述第二节点。而所述至少两个数据分区组对应的数据继续保持在所述第一节点中。In an eighth aspect, a node expansion method in a storage system is provided. The storage system includes one or more first nodes. For each first node, both data and metadata of the data are stored. And the first node includes at least two metadata partition groups and at least two data partition groups, and the metadata corresponding to each metadata partition group is used to describe the data corresponding to one of the data partition groups. The first node configures the metadata partition group and the data partition group so that the number of metadata partitions included in the metadata partition group is equal to the number of data partitions included in the data partition group. When the second node joins the storage system, the first metadata partition group of the at least two metadata partition groups and the metadata corresponding to the first metadata partition group are migrated to the second node. The data corresponding to the at least two data partition groups continues to be maintained in the first node.
在第八方面提供的存储系统中,在迁移以后仍然能满足任意一个数据分区组对应的数据的元数据均存储在一个节点中,不需要到两个节点去获取或修改元数据。In the storage system provided in the eighth aspect, after the migration, the metadata that can satisfy the data corresponding to any one data partition group is stored in one node, and there is no need to go to two nodes to obtain or modify the metadata.
第九方面,提供了一种节点扩容方法,应用于所述第八方面提供的存储系统中,由所述存储系统中的第一节点执行第八方面所提供的功能。In a ninth aspect, a node expansion method is provided, which is applied to the storage system provided in the eighth aspect, and the first node in the storage system performs the function provided in the eighth aspect.
第十方面,提供了一种节点扩容装置,位于第五方面提供的存储系统中,用于执行第八方面所提供的功能。According to a tenth aspect, there is provided a node capacity expansion device, which is located in the storage system provided by the fifth aspect and is used to perform the function provided by the eighth aspect.
附图说明BRIEF DESCRIPTION
图1是可应用本发明实施例的技术方案的场景的示意图。FIG. 1 is a schematic diagram of a scenario to which the technical solutions of embodiments of the present invention can be applied.
图2是本发明实施例的存储单元的示意图。2 is a schematic diagram of a storage unit according to an embodiment of the invention.
图3是本发明实施例提供的一种元数据分区组和数据分区组的示意图。FIG. 3 is a schematic diagram of a metadata partition group and a data partition group provided by an embodiment of the present invention.
图4是本发明实施例提供的另一种元数据分区组和数据分区组的示意图。FIG. 4 is a schematic diagram of another metadata partition group and data partition group provided by an embodiment of the present invention.
图5是本发明实施例提供的扩容前的元数据分区的布局示意图。5 is a schematic layout diagram of a metadata partition before expansion provided by an embodiment of the present invention.
图6是本发明实施例提供的扩容后的元数据分区的布局示意图。6 is a schematic layout diagram of a metadata partition after capacity expansion provided by an embodiment of the present invention.
图7是本发明实施例提供的节点扩容方法的流程示意图。7 is a schematic flowchart of a node expansion method according to an embodiment of the present invention.
图8是本发明实施例提供的节点扩容装置的结构示意图。8 is a schematic structural diagram of a node expansion device provided by an embodiment of the present invention.
图9是本发明实施例提供的存储节点的结构示意图。9 is a schematic structural diagram of a storage node provided by an embodiment of the present invention.
具体实施方式detailed description
本申请实施例在扩容时将元数据迁移至新加入的节点,而保持数据继续存储在原有的节点中。并且通过配置始终保证数据分区组对应的数据的元数据是元数据分区组对应的元数据的子集,使得一个数据分区组对应的数据仅被一个节点中存储的元数据所描述。从而达到节省带宽的目的。下面将结合附图,对本申请中的技术方案进行描述。In the embodiment of the present application, when the capacity is expanded, the metadata is migrated to the newly added node, and the retained data continues to be stored in the original node. And through configuration, it is always ensured that the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group, so that the data corresponding to one data partition group is only described by the metadata stored in one node. So as to achieve the purpose of saving bandwidth. The technical solutions in this application will be described below with reference to the drawings.
本申请实施例的技术方案可以应用于各种存储系统。在下文中以分布式存储系统为例描述本申请实施例的技术方案,但本申请实施例对此并不限定。在分布式存储系统中,数据分散存储在多台存储节点上,多台存储节点分担了存储负荷,这种存储方式不但提高了系统的可靠性、可用性和存取效率,还易于扩展。存储设备例如是存储服务器,或者是存储控制器和存储介质的组合。The technical solutions of the embodiments of the present application can be applied to various storage systems. The technical solutions of the embodiments of the present application are described below by taking a distributed storage system as an example, but the embodiments of the present application are not limited thereto. In a distributed storage system, data is distributed and stored on multiple storage nodes, and multiple storage nodes share the storage load. This storage method not only improves the reliability, availability, and access efficiency of the system, but is also easy to expand. The storage device is, for example, a storage server, or a combination of a storage controller and a storage medium.
图1是可应用本申请实施例的技术方案的场景的示意图。如图1所示,客户端服务器(client server)101和存储系统100通信,存储系统100包括交换机103和多个存储节点(或简称“节点”)104等。其中,交换机103是可选设备。每个存储节点104可以包括多个硬盘或者其他类型的存储介质(例如固态硬盘或者叠瓦式磁记录),用于存储数据。下面分四个部分介绍本申请实施例。FIG. 1 is a schematic diagram of a scenario to which the technical solutions of the embodiments of the present application can be applied. As shown in FIG. 1, a client server 101 and a storage system 100 communicate with each other. The storage system 100 includes a switch 103 and a plurality of storage nodes (or “nodes” for short) 104 and the like. Among them, the switch 103 is an optional device. Each storage node 104 may include multiple hard disks or other types of storage media (such as solid-state hard disks or shingled magnetic recording) for storing data. The following describes the embodiment of the present application in four parts.
一、存储数据的过程。1. The process of storing data.
为了保证数据均匀存储在各个存储节点104中,在选择存储节点时通常采用分布式哈希表(Distributed Hash Table,DHT)方式进行路由,但本申请实施例对此并不限定。也就是说,在本申请实施例的技术方案中,可以采用存储系统中的各种可能的路由方式。按照分布式哈希表方式,将哈希环均匀地划分为若干部分,每个部分称为一个分区(partition),每个分区对应一段设定大小的存储空间。可以理解的是,分区越多,每个分区所对应的存储空间越小,分区越少,每个分区所对应的存储空间越大。在实际应用中,分区的数目往往较大(本实施例以4096个分区为例),为了管理方便,将这些分区划分为多个分区组,每个分区组所包含的分区的数量相同。在不能绝对平均的情况下,只要保证每个分区组包含的分区的数量基本相同即可。例如,将4096个分区划分为144个分区组,其中,分区组0包含分区0-分区27,分区组1包含分区28-分区57……,分区组143包括分区4066-分区4095。分区组有自己的标识,用于唯一识别该分区组。同样的,分区也有自己标识,用于唯一识别该分区。标识可以是编号,也可以是字符串,或者编号与字符串的组合。在本实施例中,每个分区组对应一个存储节点104,“对应”的含义是指凡是通过哈希值定位到的同一个分区组的数据都会存储到同一个存储节点104中。In order to ensure that data is stored uniformly in each storage node 104, when selecting a storage node, a distributed hash table (Distributed Hash Table, DHT) method is generally used for routing, but the embodiment of the present application does not limit this. That is to say, in the technical solutions of the embodiments of the present application, various possible routing methods in the storage system may be adopted. According to the distributed hash table method, the hash ring is evenly divided into several parts, each part is called a partition (partition), and each partition corresponds to a set size of storage space. It can be understood that the more partitions, the smaller the storage space corresponding to each partition, and the fewer the partitions, the larger the storage space corresponding to each partition. In practical applications, the number of partitions is often large (in this embodiment, 4096 partitions are used as an example). For convenience of management, these partitions are divided into multiple partition groups, and each partition group contains the same number of partitions. In the case where absolute averaging is not possible, as long as the number of partitions contained in each partition group is basically the same. For example, 4096 partitions are divided into 144 partition groups, where partition group 0 contains partition 0-partition 27, partition group 1 contains partition 28-partition 57 ..., and partition group 143 includes partition 4066-partition 4095. The partition group has its own identifier, which is used to uniquely identify the partition group. Similarly, the partition has its own identification, which is used to uniquely identify the partition. The identifier can be a number, a character string, or a combination of a number and a character string. In this embodiment, each partition group corresponds to one storage node 104. The meaning of "corresponding" means that all data of the same partition group located by the hash value will be stored in the same storage node 104.
客户端服务器101发送写请求给任意一个存储节点104,所述写请求中携带待写入的数据以及所述数据的虚拟地址。虚拟地址是指该数据待写入的逻辑单元(Logical Unit,LU)的标识和偏移量,虚拟地址是对客户端服务器101可见的地址。所述接收到写请求的存储节点104根据该数据的虚拟地址进行哈希运算得到哈希值,该哈希值可以唯一确定一个目标分区。当目标分区确定下来之后,该目标分区所在的分区组也确定下来了,根据分区组与存储节点的对应关系,所述接收到写请求的存储节点可以将所述写请求转发给所述分区组对应的存储节点。一个分区组对应一个或多个存储节点。所述对应的存储节点(为了和其他存储节点104相区别,这里称为第一存储节点)将所述写请求写入它的缓存,待条件满足时再进行持久化存储。The client server 101 sends a write request to any storage node 104, where the write request carries the data to be written and the virtual address of the data. The virtual address refers to the identifier and offset of the logical unit (LU) to which the data is to be written, and the virtual address is an address that is visible to the client server 101. The storage node 104 that has received the write request performs a hash operation according to the virtual address of the data to obtain a hash value, and the hash value can uniquely determine a target partition. After the target partition is determined, the partition group where the target partition is located is also determined. According to the correspondence between the partition group and the storage node, the storage node that receives the write request may forward the write request to the partition group Corresponding storage node. One partition group corresponds to one or more storage nodes. The corresponding storage node (in order to distinguish it from other storage nodes 104, referred to herein as the first storage node) writes the write request to its cache, and performs persistent storage when the conditions are met.
在本实施例中,每个存储节点都包含有至少一个存储单元,存储单元是一段逻辑空间,实际的物理空间仍然来自多个存储节点。请参考图2,图2是本实施例提供的存储单元的结构示意图。存储单元是包含多个逻辑块的集合。逻辑块是一个空间概念,其大小以4MB为例,但不限定为4MB。一个存储节点104(仍然以第一存储节点为例)以逻辑块的形式对存储系统100中的其他存储节点104的存储空间进行使用或管理。来自不同存储节点104的硬盘上的逻辑块可以构成一个逻辑块集合,存储节点104再根据设定的独立硬盘冗余阵列(Redundant Array of Independent Disks,RAID)类型将这个逻辑块集合划分为数据存储单元和校验存储单元。包含数据存储单元和校验存储单元的逻辑块集合被称为存储单元。其中,数据存储单元中包括至少两个逻辑块,用于存储数据分配,校验存储单元中包括至少一个校验逻辑块,用于存储校验分片。包含数据存储单元和校验存储单元的逻辑块集合被称为存储单元。假设从6个存储节点中各取出一个逻辑块构成逻辑块集合,然后第一存储节点根据RAID类型(以RAID6为例)对该逻辑块集合中的逻辑块进行分组,例如逻辑块1、逻辑块2、逻辑块3和逻辑块4组成数据存储单元,逻辑块5和逻辑块6组成校验存储单元。可以理解的是,按照RAID6的冗余保护机制,任意两个数据单元或者校验单元失效时,可以根据剩下的数据单元或者校验单元重构出失效的单元。In this embodiment, each storage node includes at least one storage unit. The storage unit is a logical space, and the actual physical space still comes from multiple storage nodes. Please refer to FIG. 2, which is a schematic structural diagram of a storage unit provided by this embodiment. A storage unit is a collection containing multiple logical blocks. A logical block is a spatial concept. Its size is 4MB as an example, but it is not limited to 4MB. One storage node 104 (still taking the first storage node as an example) uses or manages the storage space of other storage nodes 104 in the storage system 100 in the form of logical blocks. The logical blocks on the hard disks from different storage nodes 104 can form a logical block set. The storage node 104 then divides the logical block set into data storage according to the set Redundant Array of Independent Disks (RAID) type. Unit and check storage unit. The set of logical blocks containing the data storage unit and the verification storage unit is called a storage unit. Wherein, the data storage unit includes at least two logical blocks for storing data distribution, and the check storage unit includes at least one check logical block for storing check fragments. The set of logical blocks containing the data storage unit and the verification storage unit is called a storage unit. Assume that a logical block is taken from each of the six storage nodes to form a logical block set, and then the first storage node groups the logical blocks in the logical block set according to the RAID type (taking RAID 6 as an example), for example, logical block 1, logical block 2. Logic block 3 and logic block 4 form a data storage unit, and logic block 5 and logic block 6 form a check storage unit. It can be understood that, according to the redundancy protection mechanism of RAID 6, when any two data units or check units fail, the failed units can be reconstructed according to the remaining data units or check units.
当第一存储节点的缓存中的数据达到设定阈值时,可以根据所述设定的RAID类型将这些数据切分为多个数据分片,并计算获得校验分片,将所述数据分片以及校验分片存储到存储单元中。这些数据分片和对应的校验分片构成一个分条。一个存储单元可以存储多个分条,并不限于图2所示的3个分条。例如,当第一存储节点中待存储的数据达到32KB(8KB*4)时,将所述数据划分为4个数据分片,每个数据分片为8KB,然后计算获得2个校验分片,每个校验分片也为8KB。第一存储节点再将各个分片发送给其所在的存储节点进行持久化存储。从逻辑上看,所述数据是写入了第一存储节点的存储单元中,从物理上看,数据最终仍然是存储在多个存储节点中的。对于每个分片而言,它所在的存储单元的标识以及位于所述存储单元内部的位置是所述分片的逻辑地址,该分片位于存储节点中的实际地址是所述分片的物理地址。When the data in the cache of the first storage node reaches a set threshold, the data can be divided into multiple data fragments according to the set RAID type, and a check fragment can be calculated to divide the data The slices and check fragments are stored in the storage unit. These data fragments and corresponding check fragments constitute a stripe. One storage unit can store multiple stripes, and is not limited to the three stripes shown in FIG. 2. For example, when the data to be stored in the first storage node reaches 32KB (8KB * 4), divide the data into 4 data fragments, each data fragment is 8KB, and then calculate to obtain 2 check fragments , Each check fragment is also 8KB. The first storage node then sends each shard to the storage node where it is located for persistent storage. Logically, the data is written into the storage unit of the first storage node. Physically, the data is still stored in multiple storage nodes. For each shard, the identifier of the storage unit where it is located and the location inside the storage unit are the logical addresses of the shards, and the actual address of the shard in the storage node is the physical address.
二、存储元数据的过程Second, the process of storing metadata
当数据存储至存储节点之后,为了将来能查找这些数据,还需要存储这些数据的描述信息,我们把这些描述数据的描述信息称之为元数据。存储节点在收到读请求时,通常根据该读请求中携带的虚拟地址找到待读取数据的元数据,再进一步根据所述元数据获取所述待读取数据。元数据包括但不限于:每个分片的逻辑地址与物理地址之间的对应关系,所述数据的虚拟地址与该数据所包含的各个分片的逻辑地址之间的对应关系。该数据所包含的各个分片的逻辑地址的集合也就是该数据的逻辑地址。After the data is stored in the storage node, in order to find these data in the future, we also need to store the description information of these data. We call the description information of these description data as metadata. When receiving a read request, the storage node usually finds metadata of the data to be read according to the virtual address carried in the read request, and then further obtains the data to be read according to the metadata. The metadata includes but is not limited to: the correspondence between the logical address and the physical address of each slice, and the correspondence between the virtual address of the data and the logical addresses of the slices contained in the data. The logical address set of each fragment contained in the data is also the logical address of the data.
与存储数据的过程类似,元数据所在的分区也是根据读请求或写请求中携带的虚拟地址确定的,具体的,对所述虚拟地址进行哈希运算得到哈希值,通过所述哈希值可以唯一确定一个目标分区,从而进一步确定所述目标分区所在的目标分区组,然后将待存储的元数据发送到所述目标分区组所对应的存储节点(例如第一存储节点)。当所述第一存储节点中待存储的元数据达到设定阈值(例如,32KB)时,将所述元数据划分为4个数据分片,然后计算获得2个校验分片。再将这些分片发送给多个存储节点。Similar to the process of storing data, the partition where the metadata is located is also determined according to the virtual address carried in the read request or the write request. Specifically, a hash operation is performed on the virtual address to obtain a hash value, and the hash value A target partition can be uniquely determined, so as to further determine the target partition group where the target partition is located, and then send the metadata to be stored to the storage node (for example, the first storage node) corresponding to the target partition group. When the metadata to be stored in the first storage node reaches a set threshold (for example, 32KB), the metadata is divided into 4 data fragments, and then 2 check fragments are calculated. Then send these fragments to multiple storage nodes.
在本实施例中,数据的分区和元数据的分区相互独立。换言之,数据有自己的分区机制,元数据也有自己的分区机制。但数据的分区的总数和元数据的分区的总数是相同的。例如,数据的分区的总数是4096个,元数据的分区的总数也是4096个。为了描述方便,本发明实施例中将对应数据的分区称为数据分区,对应元数据的分区称为元数据分区。将对应数据的分区组称为数据分区组,对应元数据的分区组称为元数据分区组。由于元数据分区和数据分区都是根据读请求或写请求中携带的虚拟地址确定的,因此一个元数据分区对应的元数据是用于描述与其具有相同标识的数据分区对应的数据的。例如,元数据分区1对应的元数据用于描述数据分区1对应的数据,元数据分区2对应的元数据用于描述数据分区2对应的数据,元数据分区N对应的元数据用于描述数据分区N对应的数据,其中,N为大于或等于2的整数。数据和该数据的元数据可以存储在同一个存储节点中,也可以存储在不同的存储节点中。In this embodiment, the data partition and the metadata partition are independent of each other. In other words, data has its own partitioning mechanism, and metadata also has its own partitioning mechanism. However, the total number of data partitions and the total number of metadata partitions are the same. For example, the total number of data partitions is 4096, and the total number of metadata partitions is also 4096. For convenience of description, in the embodiment of the present invention, a partition corresponding to data is called a data partition, and a partition corresponding to metadata is called a metadata partition. The partition group corresponding to data is called a data partition group, and the partition group corresponding to metadata is called a metadata partition group. Since the metadata partition and the data partition are determined according to the virtual address carried in the read request or the write request, the metadata corresponding to a metadata partition is used to describe the data corresponding to the data partition having the same identifier. For example, the metadata corresponding to metadata partition 1 is used to describe the data corresponding to data partition 1, the metadata corresponding to metadata partition 2 is used to describe the data corresponding to data partition 2, and the metadata corresponding to metadata partition N is used to describe the data Data corresponding to partition N, where N is an integer greater than or equal to 2. The data and the metadata of the data may be stored in the same storage node, or may be stored in different storage nodes.
元数据存储之后,存储节点在收到读请求时,就可以通过读取元数据获知待读取数据的物理地址。具体的,当任意一个存储节点104收到客户端服务器101发送的读请求时,该节点104对所述读请求中携带的虚拟地址进行哈希计算获得哈希值,从而获得该哈希值对应的元数据分区及其元数据分区组。假设所述元数据分区组对应的存储单元归属于第一存储节点,那么所述接收读请求的存储节点104会将所述读请求转发给第一存 储节点。所述第一存储节点从所述存储单元中读取出待读取数据的元数据。第一存储节点再根据所述元数据从多个存储节点中获取组成该待读取数据的各个分片,验证无误后再聚合成所述待读取的数据返回给客户端服务器101。After the metadata is stored, when the storage node receives the read request, it can learn the physical address of the data to be read by reading the metadata. Specifically, when any storage node 104 receives the read request sent by the client server 101, the node 104 performs a hash calculation on the virtual address carried in the read request to obtain a hash value, thereby obtaining the corresponding hash value 'S metadata partition and its metadata partition group. Assuming that the storage unit corresponding to the metadata partition group belongs to the first storage node, the storage node 104 receiving the read request will forward the read request to the first storage node. The first storage node reads the metadata of the data to be read from the storage unit. According to the metadata, the first storage node obtains each piece that constitutes the data to be read from multiple storage nodes, verifies that the data is to be read, and then aggregates the data to be read back to the client server 101.
三、扩容3. Capacity expansion
随着存储系统100中存储的数据越来越多,其存储空间也日渐减少,因此需要增加存储系统100中的存储节点的数量,这个过程被称为扩容。当新的存储节点(简称:新节点)加入存储系统100之后,存储系统100会将旧的存储节点(简称:旧节点)的分区以及分区对应的数据迁移至新节点中。例如,假设存储系统100原来有8个存储节点,扩容之后变成16个存储节点,那么原来的8个存储节点就需要分别迁移一半分区以及这些分区对应的数据至新加入的8个存储节点中。为了节省存储节点之间的带宽资源,目前有一种做法是,只迁移元数据分区及其对应的元数据,不迁移数据分区。元数据迁移至新的存储节点之后,由于元数据记录有数据的逻辑地址与物理地址之间的对应关系,即使客户端服务器101向新节点发送读请求,也可以通过所述对应关系找到该数据在旧节点中的位置从而读出该数据。例如,如果将元数据分区1对应的元数据迁移至新节点,当客户端服务器101向所述新节点发送读请求要求读取数据分区1对应的数据时,虽然数据分区1对应的数据并没有迁移至所述新节点,但仍然可以根据元数据分区1对应的元数据找到待读取数据的物理地址,从旧节点读取该数据。As more and more data is stored in the storage system 100, its storage space is also decreasing. Therefore, it is necessary to increase the number of storage nodes in the storage system 100. This process is called capacity expansion. After a new storage node (abbreviation: new node) joins the storage system 100, the storage system 100 migrates the partition of the old storage node (abbreviation: old node) and data corresponding to the partition to the new node. For example, assuming that the storage system 100 originally had 8 storage nodes, and then expanded to become 16 storage nodes, then the original 8 storage nodes need to migrate half of the partitions and the data corresponding to these partitions to the newly added 8 storage nodes. . In order to save bandwidth resources between storage nodes, there is currently a method of migrating only metadata partitions and corresponding metadata, and not migrating data partitions. After the metadata is migrated to the new storage node, since the metadata records the correspondence between the logical address and the physical address of the data, even if the client server 101 sends a read request to the new node, the data can be found through the correspondence The position in the old node thus reads the data. For example, if the metadata corresponding to metadata partition 1 is migrated to a new node, when the client server 101 sends a read request to the new node to read the data corresponding to data partition 1, although the data corresponding to data partition 1 is not It migrates to the new node, but it can still find the physical address of the data to be read according to the metadata corresponding to the metadata partition 1, and read the data from the old node.
另外,节点扩容时通常是以分区组为单位来迁移分区及其数据的。如果元数据分区组所对应的元数据比用来描述数据分区组所对应的数据的元数据少,就会导致同一个存储单元被至少两个元数据分区组引用,由此带来管理上的不便。In addition, when expanding a node, the partition and its data are usually migrated in units of partition groups. If the metadata corresponding to the metadata partition group is less than the metadata used to describe the data corresponding to the data partition group, it will cause the same storage unit to be referenced by at least two metadata partition groups, thereby bringing management inconvenient.
通常,元数据分区组包含分区的数量通常会小于数据分区组包含分区的数量。请参考图3,图3中每个元数据分区组包含32个分区,每个数据分区组包含64个分区。示例性的,数据分区组1包含分区0-分区63。分区0-分区63对应的数据都存储在存储单元1中,元数据分区组1包含分区0-分区31,元数据分区组2包含分区32-分区63。可见,元数据分区组1和元数据分区组2这两个所包含的所有的分区都是用于描述存储单元1中的数据的。在扩容之前,元数据分区组1和元数据分区组2分别指向存储单元1。新节点加入以后,假设将旧节点中的元数据分区组1及其对应的元数据迁移至所述新的存储节点,那么迁移之后旧节点中不再有元数据分区组1,其指向关系被删除(用虚线箭头表示),新节点中的元数据分区组1指向所述存储单元1。另外,旧节点中的元数据分区组2没有被迁移,仍然指向所述存储单元1。那么,在扩容之后存储单元1同时被旧节点中的元数据分区组2以及新节点中的元数据分区组1所引用。当存储单元1中的数据发生变化时就需要到两个存储节点(旧节点和新节点)中查找相应的元数据并进行修改,这就会造成管理上的复杂,尤其是垃圾回收操作的复杂程度。Generally, the number of partitions included in the metadata partition group is usually smaller than the number of partitions included in the data partition group. Please refer to FIG. 3, each metadata partition group in FIG. 3 contains 32 partitions, and each data partition group contains 64 partitions. Exemplarily, the data partition group 1 includes partition 0 to partition 63. The data corresponding to partition 0 to partition 63 are all stored in the storage unit 1, the metadata partition group 1 includes partition 0 to partition 31, and the metadata partition group 2 includes partition 32 to partition 63. It can be seen that all the partitions included in the metadata partition group 1 and the metadata partition group 2 are used to describe the data in the storage unit 1. Before the capacity expansion, the metadata partition group 1 and the metadata partition group 2 point to the storage unit 1 respectively. After the new node is added, it is assumed that the metadata partition group 1 in the old node and its corresponding metadata are migrated to the new storage node. After the migration, there is no longer metadata partition group 1 in the old node. Delete (indicated by a dotted arrow), the metadata partition group 1 in the new node points to the storage unit 1. In addition, the metadata partition group 2 in the old node is not migrated, and still points to the storage unit 1. Then, after the expansion, the storage unit 1 is simultaneously referenced by the metadata partition group 2 in the old node and the metadata partition group 1 in the new node. When the data in the storage unit 1 changes, it is necessary to find and modify the corresponding metadata in the two storage nodes (the old node and the new node), which will cause management complexity, especially the complexity of the garbage collection operation. degree.
为了解决上述问题,在本实施例中,元数据分区组所包含的分区的数量被设置为大于或等于数据分区组所包含的分区的数量。换言之,一个元数据分区组所对应的元数据多于或等于用于描述一个数据分区组所对应的数据的元数据。例如,每个元数据分区组包含64个分区,每个数据分区组包含32个分区。如图4所示,元数据分区组1包含分区0-分区63,数据分区组1包括分区0-分区31,数据分区组2包含分区32-分区63。数据分区组1对应的数据存储在存储单元1中,数据分区组2对应的数据存储在存储单元2 中。扩容之前,旧节点中的元数据分区组1分别指向存储单元1和存储单元2。扩容之后,将元数据分区组1及其对应的元数据迁移至新的存储节点,那么新节点中的元数据分区组1分别指向存储单元1和存储单元2,由于旧节点中已经不存在所述元数据分区组1了,所以其指向关系被删除(用虚线箭头表示)。可见,无论是对于存储单元1还是对于存储单元2,都只被一个元数据分区组引用,简化了管理的复杂度。To solve the above problem, in this embodiment, the number of partitions included in the metadata partition group is set to be greater than or equal to the number of partitions included in the data partition group. In other words, the metadata corresponding to one metadata partition group is more than or equal to the metadata used to describe the data corresponding to one data partition group. For example, each metadata partition group contains 64 partitions, and each data partition group contains 32 partitions. As shown in FIG. 4, the metadata partition group 1 includes partitions 0-63, the data partition group 1 includes partitions 0-31, and the data partition group 2 includes partitions 32-63. The data corresponding to the data partition group 1 is stored in the storage unit 1, and the data corresponding to the data partition group 2 is stored in the storage unit 2. Before the expansion, the metadata partition group 1 in the old node points to storage unit 1 and storage unit 2, respectively. After capacity expansion, the metadata partition group 1 and its corresponding metadata are migrated to the new storage node. Then the metadata partition group 1 in the new node points to storage unit 1 and storage unit 2, respectively. Since the old node no longer exists The metadata partition group 1 is described, so its pointing relationship is deleted (indicated by a dotted arrow). It can be seen that both the storage unit 1 and the storage unit 2 are only referenced by one metadata partition group, which simplifies the management complexity.
因此,在本实施例中在扩容前会对元数据分区组和数据分区组进行配置,使得元数据分区组所包含的分区的数量被设置为大于数据分区组所包含的分区的数量。扩容之后,将旧节点中的元数据分区组分裂成至少两个子元数据分区组之后将至少一个子元数据分区组及其对应的元数据迁移至新节点。然后再将旧节点中的数据分区组分裂成至少两个子数据分区组,使得子元数据分区组所包含的分区的数量被设置为大于或等于子数据分区组所包含的分区的数量,从而为下一次扩容做准备。Therefore, in this embodiment, the metadata partition group and the data partition group are configured before capacity expansion, so that the number of partitions included in the metadata partition group is set to be greater than the number of partitions included in the data partition group. After capacity expansion, the metadata partition group in the old node is split into at least two sub-metadata partition groups, and then at least one sub-metadata partition group and its corresponding metadata are migrated to the new node. Then split the data partition group in the old node into at least two sub-data partition groups, so that the number of partitions included in the sub-metadata partition group is set to be greater than or equal to the number of partitions included in the sub-data partition group. Prepare for the next expansion.
下面以一个具体的例子来说明扩容的过程,请参见图5,图5是扩容前各个存储节点的元数据分区组分布图。The following describes the expansion process with a specific example. Please refer to FIG. 5, which is a distribution diagram of the metadata partition groups of each storage node before expansion.
在本实施例中可以预先设定每个存储节点分配的分区组的数量。当存储节点包含多个处理单元时,为了使得读写请求均匀地分散在各个处理单元上,本发明实施例也可以设定每个处理单元对应一定数量的分区组,处理单元是指节点内的CPU。如表1所示:In this embodiment, the number of partition groups allocated to each storage node may be preset. When the storage node includes multiple processing units, in order to evenly distribute read and write requests on each processing unit, the embodiment of the present invention may also set each processing unit to correspond to a certain number of partition groups. The processing unit refers to the CPU. As shown in Table 1:
存储节点的数量Number of storage nodes 处理单元的数量Number of processing units 分区组的数量Number of partition groups
33 24twenty four 144144
44 3232 192192
55 4040 240240
66 4848 288288
77 5656 336336
88 6464 384384
99 7272 432432
1010 8080 480480
1111 8888 528528
1212 9696 576576
1313 104104 624624
1414 112112 672672
1515 120120 720720
表1Table 1
表1描述的是节点与节点的处理单元,以及与分区组之间的关系,例如每个节点有8个处理单元,每个处理单元分配的分区组为6,那么每个节点所分配的分区组的数量为48。假设,存储系统100在扩容前有3个存储节点,其所拥有的分区组的数量为144。按照前面的描述,存储系统100初始化时即配置了分区总数,例如分区总数为4096个分区。如果要将4096平均分布在144个分区组中,那么每个分区组需拥有4096/144=28.44个。然而,每个分区组所包含的分区的数量必须是整数,并且是2的N次方,其中N是大于或等于0的整数。因此,这4096个分组并不能绝对平均地分布在144个分区组中。可以确定的是,28.44小于32(2的5次方),并且大于16(2的4次方),因此这144个分 区组中的X个第一分区组包含32个分区,Y个第二分区组包含16个分区。其中,X,Y满足下述方程式:32X+16Y=4096,以及X+Y=144。Table 1 describes the relationship between nodes and the processing units of the nodes and the partition groups. For example, each node has 8 processing units and each processing unit is assigned a partition group of 6, then each node is assigned a partition The number of groups is 48. Assume that the storage system 100 has three storage nodes before capacity expansion, and the number of partition groups it has is 144. According to the foregoing description, the total number of partitions is configured when the storage system 100 is initialized, for example, the total number of partitions is 4096 partitions. If 4096 is to be evenly distributed among 144 partition groups, then each partition group needs to have 4096/144 = 28.44. However, the number of partitions contained in each partition group must be an integer and be the Nth power of 2, where N is an integer greater than or equal to 0. Therefore, the 4096 packets cannot be distributed evenly among the 144 partition groups. What can be determined is that 28.44 is less than 32 (2 to the 5th power) and greater than 16 (2 to the 4th power), so the X first partition group of these 144 partition groups contains 32 partitions and Y second The partition group contains 16 partitions. Among them, X, Y satisfy the following equations: 32X + 16Y = 4096, and X + Y = 144.
通过上述两个方程式计算得出X=112,Y=32。这意味着,144个分区组中有112个第一分区组和32个第二分区组,其中每个第一分区组包含32个分区,每个第二分区组包含16个分区。然后根据第一分区组的总数和处理单元的总数计算每个处理单元配置的第一分区组数量(112/(3*8)=4…16),再根据第二分区组的总数和处理单元的总数计算每个处理单元配置的第二分区组(32/(3*8)=1…8)。由此得出每个处理单元至少配置了4个第一分区组和2个第二分区组,剩下的8个第二分区尽量均匀地分布在3个节点上(如图5所示)。It is calculated by the above two equations that X = 112 and Y = 32. This means that there are 112 first partition groups and 32 second partition groups out of 144 partition groups, where each first partition group contains 32 partitions and each second partition group contains 16 partitions. Then calculate the number of the first partition group configured by each processing unit according to the total number of the first partition group and the total number of processing units (112 / (3 * 8) = 4 ... 16), and then according to the total number of the second partition group and the processing unit The total number of is calculated for the second partition group configured by each processing unit (32 / (3 * 8) = 1 ... 8). It can be concluded that each processing unit is configured with at least 4 first partition groups and 2 second partition groups, and the remaining 8 second partitions are evenly distributed on 3 nodes (as shown in FIG. 5).
请参考图6,图6是扩容后各个存储节点的元数据分区组分布图。假设存储系统100新加入2个存储节点,那么此时存储系统100有5个存储节点。根据表1,5个存储节点共有40个处理单元,每个处理单元配置有6个分区组,那么5个存储节点共有240个分区组。分区总数为4096个分区。如果要将4096平均分布在240个分区组中,那么每个分区组需拥有4096/240=17.07个。然而,每个分区组所包含的分区的数量必须是整数,并且是2的N次方,其中N是大于或等于0的整数。因此,这4096个分组并不能绝对平均地分布在240个分区组中。可以确定的是,17.07小于32(2的5次方),并且大于16(2的4次方),因此这240个分区组中的X个第一分区组包含32个分区,Y个第二分区组包含16个分区。其中,X,Y满足下述方程式:32X+16Y=4096,以及X+Y=240。Please refer to FIG. 6, which is a distribution diagram of metadata partition groups of each storage node after capacity expansion. Assuming that the storage system 100 newly adds 2 storage nodes, then the storage system 100 has 5 storage nodes at this time. According to Table 1, 5 storage nodes have a total of 40 processing units, and each processing unit is configured with 6 partition groups, so 5 storage nodes have a total of 240 partition groups. The total number of partitions is 4096 partitions. If 4096 is to be evenly distributed among 240 partition groups, then each partition group needs to have 4096/240 = 17.07. However, the number of partitions contained in each partition group must be an integer and be the Nth power of 2, where N is an integer greater than or equal to 0. Therefore, these 4096 packets cannot be distributed evenly among 240 partition groups. What can be determined is that 17.07 is less than 32 (2 to the 5th power) and greater than 16 (2 to the 4th power), so the X first partition group of these 240 partition groups contains 32 partitions and Y second The partition group contains 16 partitions. Among them, X, Y satisfy the following equations: 32X + 16Y = 4096, and X + Y = 240.
通过上述两个方程式计算得出X=16,Y=224。这意味着,240个分区组中有16个第一分区组和224个第二分区组,其中每个第一分区组包含32个分区,每个第二分区组包含16个分区。然后根据第一分区组的总数和处理单元的总数计算每个处理单元配置的第一分区组数量(16/(5*8)=0…16),再根据第二分区组的总数和处理单元的总数计算每个处理单元配置的第二分区组(224/(5*8)=5…24)。由此得出只有16个处理单元配置了一个第一分区组,每个处理单元至少配置了5个第二分区组,剩下的24个第二分区尽量均匀地分布在5个节点上(如图6所示)。It is calculated by the above two equations that X = 16 and Y = 224. This means that of the 240 partition groups, there are 16 first partition groups and 224 second partition groups, where each first partition group contains 32 partitions and each second partition group contains 16 partitions. Then calculate the number of the first partition group configured by each processing unit according to the total number of the first partition group and the total number of processing units (16 / (5 * 8) = 0 ... 16), and then according to the total number of the second partition group and the processing unit The total number of is calculated for the second partition group configured by each processing unit (224 / (5 * 8) = 5 ... 24). It is concluded that only 16 processing units are configured with a first partition group, each processing unit is configured with at least 5 second partition groups, and the remaining 24 second partitions are distributed as evenly as possible on 5 nodes (such as (Figure 6).
根据扩容前3个节点的分区布局示意图以及扩容后5个节点的分区布局示意图,可以将扩容前的3个节点中的一部分第一分区组分裂成两个第二分区组,再根据图6所示的每个节点的分区分布,从3个节点中迁移一些第一分区组和第二分区组至节点4和节点5。例如,根据图5所示,扩容前存储系统100中有112个第一分区组,而扩容后有16个第一分区组,因此在112个第一分区组中有96个第一分区组需要分裂。96个第一分区组分裂成192个第二分区组,因此分裂之后3个节点上共有16个第一分区组和224个第二分区,然而每个节点再分别迁移一部分第一分区组和一部分第二分区组到节点4和节点5中。以节点1的处理单元1为例,根据图5所示,扩容前处理单元1配置有4个第一分区组和3个第二分区组,根据图5所示扩容后处理单元1配置有1个第一分区组和5个分区组。这说明,处理单元1中有3个第一分区组需迁移出去,或者分裂成多个第二分区组后再迁移出去。至于这3个第一分区组有多少是直接迁移至新加入的节点,有多少是分裂之后再迁移至新加入的节点,本实施例不对此进行任何限制,只要迁移后满足图6所示的分区分布即可。其余各节点的各个处理单元也按照同样的思路进行迁移和分裂。According to the schematic diagram of the partition layout of the three nodes before the expansion and the schematic diagram of the partition layout of the five nodes after the expansion, a part of the first partition group of the three nodes before the expansion can be split into two second partition groups, and then according to Figure 6 The partition distribution of each node is shown, and some first partition groups and second partition groups are migrated from 3 nodes to node 4 and node 5. For example, according to FIG. 5, there are 112 first partition groups in the storage system 100 before capacity expansion, and 16 first partition groups after capacity expansion, so 96 of the 112 first partition groups need to be in the 112 first partition groups Split. 96 first partition groups are split into 192 second partition groups, so after the split, there are 16 first partition groups and 224 second partitions on the 3 nodes, but each node migrates part of the first partition group and part The second partition group into node 4 and node 5. Taking the processing unit 1 of the node 1 as an example, according to FIG. 5, the processing unit 1 before expansion is configured with 4 first partition groups and 3 second partition groups, and the processing unit 1 after expansion according to FIG. 5 is configured with 1 First partition group and 5 partition groups. This shows that there are three first partition groups in the processing unit 1 that need to be migrated out, or split into multiple second partition groups and then migrated out. As for how many of these three first partition groups are directly migrated to newly added nodes, and how many are migrated to newly added nodes after splitting, this embodiment does not impose any restrictions on this, as long as the migration meets the requirements shown in FIG. 6 It can be distributed by partition. The processing units of the remaining nodes also migrate and split in the same way.
在上述例子中,扩容前的3个存储节点先将部分第一分区组分裂成第二分区组之后 再迁移至新节点,在另一种实施方式中也可以先将部分第一分区组迁移至新节点之后再进行分裂,同样能够达到图6所示的分区分布。In the above example, the three storage nodes before expansion split some of the first partition group into the second partition group and then migrate to the new node. In another embodiment, part of the first partition group can also be migrated to After the new node is split again, the partition distribution shown in FIG. 6 can also be achieved.
需要说明的是,上面的描述以及图5的示例是针对元数据分区组而言的,然而对于数据分区组,每个数据分区组所包含的数据分区的数量需小于元数据分区组所包含的元数据分区的数量。因此,迁移完成之后数据分区组需要进行分裂,分裂之后的子数据分区组所包含的分区的数量需小于子元数据分区组所包含的元数据分区的数量。分裂的目的是使得当前的元数据分区组对应的元数据始终包含用于描述当前的数据分区组对应的数据的元数据。由于上述例子中有的元数据分区组包含32个元数据分区,有的元数据分区组包含16个元数据分区,因此,分裂后所得的子数据分区组所包含的数据分区组的数量可以是16,或者8,或者4,或者2,总之不能超过16。It should be noted that the above description and the example of FIG. 5 are for the metadata partition group. However, for the data partition group, the number of data partitions contained in each data partition group needs to be smaller than that contained in the metadata partition group The number of metadata partitions. Therefore, the data partition group needs to be split after the migration is completed, and the number of partitions included in the sub-data partition group after the split must be smaller than the number of metadata partitions included in the sub-metadata partition group. The purpose of the split is to make the metadata corresponding to the current metadata partition group always contain metadata describing the data corresponding to the current data partition group. Since some of the metadata partition groups in the above example contain 32 metadata partitions, and some metadata partition groups contain 16 metadata partitions, the number of data partition groups contained in the sub-data partition group obtained after the split may be 16, or 8, or 4, or 2, in short, cannot exceed 16.
四、垃圾回收4. Garbage recycling
当存储系统100中的垃圾数据较多时,可以启动垃圾回收。本实施例中,垃圾回收是以存储单元为单位进行的。选择一个存储单元作为垃圾回收的对象,将这个存储单元中的有效数据迁移至一个新的存储单元中,然后再释放原来的存储单元所占用的存储空间。所述选择出的存储单元需要满足一定条件,例如该存储单元包含的垃圾数据到达第一设定阈值,或者该存储单元是所述多个存储单元中包含垃圾数据最多的存储单元,或者该存储单元包含的有效数据低于第二设定阈值,或者该存储单元是所述多个存储单元中包含有效数据最少的存储单元。为了方便描述,本实施例将所述选择出的进行垃圾回收的存储单元称为第一存储单元或者存储单元1。When there is more garbage data in the storage system 100, garbage collection may be started. In this embodiment, garbage collection is performed in units of storage units. Select a storage unit as the object of garbage collection, migrate the valid data in this storage unit to a new storage unit, and then release the storage space occupied by the original storage unit. The selected storage unit needs to meet certain conditions, for example, the junk data contained in the storage unit reaches the first set threshold, or the storage unit is the storage unit that contains the most junk data in the plurality of storage units, or the storage The valid data contained in the unit is lower than the second set threshold, or the storage unit is the storage unit containing the least valid data among the plurality of storage units. For convenience of description, in this embodiment, the selected storage unit for garbage collection is called a first storage unit or a storage unit 1.
参考图3,以对存储单元1执行垃圾回收为例来说明通常垃圾回收的方法。垃圾回收的执行主体是存储单元1所归属的存储节点(继续以第一存储节点为例),第一存储节点从存储单元1中读取有效数据,将这些有效数据写入一个新的存储单元。然后,将存储单元1中的所有数据标记为无效,并且向每个分片所在的存储节点发送删除请求以删除分片。最后,第一存储节点还需要修改用于描述存储单元1中的数据的元数据。由图3可知,元数据分区组2所对应的元数据以及元数据分区组1所对应的元数据均是用于描述存储单元1中的数据的元数据,并且元数据分区组2和元数据分区组1分别位于不同的存储节点中。因此,第一存储节点需要分别在两个存储节点中修改元数据,在修改的过程中会在节点之间产生多次读请求以及写请求,严重消耗节点间的带宽资源。With reference to FIG. 3, taking the garbage collection on the storage unit 1 as an example to illustrate a general garbage collection method. The main body of garbage collection is the storage node to which the storage unit 1 belongs (continue to take the first storage node as an example). The first storage node reads valid data from the storage unit 1 and writes the valid data to a new storage unit . Then, mark all data in the storage unit 1 as invalid, and send a delete request to the storage node where each shard is located to delete the shard. Finally, the first storage node also needs to modify the metadata used to describe the data in the storage unit 1. As can be seen from FIG. 3, the metadata corresponding to the metadata partition group 2 and the metadata corresponding to the metadata partition group 1 are metadata used to describe the data in the storage unit 1, and the metadata partition group 2 and the metadata Partition group 1 is located in different storage nodes. Therefore, the first storage node needs to modify the metadata in the two storage nodes respectively. During the modification process, multiple read requests and write requests will be generated between the nodes, which seriously consumes bandwidth resources between the nodes.
参考图4,以对存储单元2执行垃圾回收为例来说明本发明实施例的垃圾回收方法。垃圾回收的执行主体是存储单元2所归属的存储节点(以第二存储节点为例),第二存储节点从存储单元2中读取有效数据,将这些有效数据写入一个新的存储单元。然后,将存储单元2中的所有数据标记为无效,并且向每个分片所在的存储节点发送删除请求以删除分片。最后,第二存储节点还需要修改用于描述存储单元2中的数据的元数据。由图4可知,存储单元2只被元数据分区组1引用,也就是说只有元数据分区组1对应的元数据用于描述存储单元2中的数据。因此,第二存储节点只需要向元数据分区组1所在的存储节点发送请求以修改元数据。和示例一相比,由于第二存储节点只需要在一个存储节点上修改元数据,大大节省了节点间的带宽资源。Referring to FIG. 4, the garbage collection method of the embodiment of the present invention will be described by taking garbage collection on the storage unit 2 as an example. The execution subject of garbage collection is the storage node to which the storage unit 2 belongs (taking the second storage node as an example). The second storage node reads valid data from the storage unit 2 and writes the valid data into a new storage unit. Then, mark all data in the storage unit 2 as invalid, and send a delete request to the storage node where each shard is located to delete the shard. Finally, the second storage node also needs to modify the metadata describing the data in the storage unit 2. It can be seen from FIG. 4 that the storage unit 2 is only referenced by the metadata partition group 1, that is, only the metadata corresponding to the metadata partition group 1 is used to describe the data in the storage unit 2. Therefore, the second storage node only needs to send a request to the storage node where the metadata partition group 1 is located to modify the metadata. Compared with Example 1, since the second storage node only needs to modify metadata on one storage node, the bandwidth resources between the nodes are greatly saved.
下面结合流程图来介绍本实施例提供的节点扩容方法。请参考图7,图7是节点扩容 方法的方法流程图。该方法应用在图1所示的存储系统中,所述存储系统包括多个第一节点。第一节点是指扩容前所述存储系统已经存在的节点,具体可参见图1或图2所示的节点104。每个第一节点都可以按照图7所示的步骤执行所述节点扩容方法。The node expansion method provided in this embodiment is described below in conjunction with a flowchart. Please refer to FIG. 7, which is a method flowchart of a node expansion method. This method is applied to the storage system shown in FIG. 1, which includes multiple first nodes. The first node refers to a node that already exists in the storage system before capacity expansion. For details, refer to the node 104 shown in FIG. 1 or FIG. 2. Each first node may perform the node expansion method according to the steps shown in FIG. 7.
S701,配置第一节点的数据分区组和元数据分区组。所述数据分区组包括多个数据分区,所述元数据分区组包括多个元数据分区。配置后的数据分区组对应的数据的元数据是元数据分区组所对应的元数据的子集。这里的子集有两层含义,一是所述元数据分区组对应的元数据包含了用于描述所述数据分区组所对应的数据的元数据,二是所述元数据分区组所包含的元数据分区的数量大于所述数据分区组所包含的数据分区的数据。举个例子,数据分区组包含M个数据分区,分别是数据分区1、数据分区2……数据分区M。元数据分区组包含N个元数据分区,N大于M,分别是元数据分区1、元数据分区2……,元数据分区M……元数据分区N。根据前面的描述,元数据分区1对应的元数据用于描述数据分区1对应的数据,元数据分区2对应的元数据用于描述数据分区2对应的数据,元数据分区M对应的元数据用于描述数据分区M对应的数据。因此所述元数据分区组包含所有用于描述M个数据分区对应的数据的元数据,除此之外,所述元数据分区组还包含用于描述其他数据分区组对应的数据的元数据。S701: Configure a data partition group and a metadata partition group of the first node. The data partition group includes multiple data partitions, and the metadata partition group includes multiple metadata partitions. The metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group. The subset here has two meanings, one is that the metadata corresponding to the metadata partition group contains metadata describing the data corresponding to the data partition group, and the second is that the metadata partition group contains The number of metadata partitions is larger than the data of the data partitions included in the data partition group. For example, the data partition group contains M data partitions, namely data partition 1, data partition 2, ... data partition M. The metadata partition group includes N metadata partitions, where N is greater than M, namely, metadata partition 1, metadata partition 2, ..., metadata partition M, and metadata partition N. According to the previous description, the metadata corresponding to metadata partition 1 is used to describe the data corresponding to data partition 1, the metadata corresponding to metadata partition 2 is used to describe the data corresponding to data partition 2, and the metadata corresponding to metadata partition M is used to The data corresponding to the data partition M is described. Therefore, the metadata partition group contains all metadata describing data corresponding to the M data partitions. In addition, the metadata partition group also contains metadata describing data corresponding to other data partition groups.
S701中所描述的第一节点就是扩容部分描述的旧节点。另外,需要说明的是,所述第一节点的数据分区组可以是一个也可以是多个,同理,所述第一节点的数据分区组也可以是一个或多个。The first node described in S701 is the old node described in the expansion section. In addition, it should be noted that there may be one or more data partition groups of the first node. Similarly, there may be one or more data partition groups of the first node.
S702,当第二节点加入所述存储系统时,将所述元数据分区组分裂成至少两个子元数据分区组。当所述第一节点包含一个元数据分区组时,这一个元数据分区组需要分裂成至少两个子元数据分区组。当所述第一节点包含多个元数据分区组时,可能只有部分元数据分区组需要分裂,而剩余的元数据分区组继续保持原有的元数据分区。至于哪些元数据分区组需要分裂,如何分裂,可以根据扩容后元数据分区组布局和扩容前元数据分区组布局来确定。所述扩容后元数据分区组布局包括所述第二节点加入所述存储系统后所述存储系统中每个节点配置的所述子元数据分区组的数量,以及所述子元数据分区组包含的元数据分区的数量,所述扩容前元数据分区组布局包括所述第二节点加入所述存储系统之前所述第一节点配置的所述元数据分区组的数量,以及所述元数据分区组包含的元数据分区的数量。具体实现可参考扩容部分中与图5、图6相关的描述。S702. When the second node joins the storage system, split the metadata partition group into at least two sub-metadata partition groups. When the first node includes a metadata partition group, this metadata partition group needs to be split into at least two sub-metadata partition groups. When the first node includes multiple metadata partition groups, only part of the metadata partition groups may need to be split, while the remaining metadata partition groups continue to maintain the original metadata partition. As for which metadata partition groups need to be split and how to split, it can be determined according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion. The layout of the expanded metadata partition group includes the number of the sub-metadata partition groups configured for each node in the storage system after the second node joins the storage system, and the sub-metadata partition group includes The number of metadata partitions of the system, the layout of the metadata partition group before capacity expansion includes the number of the metadata partition groups configured by the first node before the second node joins the storage system, and the metadata partition The number of metadata partitions contained in the group. For specific implementation, refer to the description related to FIG. 5 and FIG. 6 in the capacity expansion section.
分裂这个动作,在实际实现时是指映射关系的改变。具体而言,分裂之前原有的元数据分区组的标识与所述元数据分区组所包含的每个元数据分区的标识之间具有映射关系,分裂之后增加至少两个子元数据分区组的标识,并且删除原有的元数据分区组所包含的元数据分区的标识与所述原有的元数据分区组的标识之间的映射关系,创建所述原有的元数据分区组所包含的一部分元数据分区的标识与其中一个子元数据分区组的标识之间的映射关系,以及所述原有的元数据分区组所包含的另一部分元数据分区的标识与另一个子元数据分区组的标识之间的映射关系。Splitting this action refers to the change of the mapping relationship in actual implementation. Specifically, there is a mapping relationship between the identifier of the original metadata partition group before the split and the identifier of each metadata partition contained in the metadata partition group, and at least two sub metadata partition group IDs are added after the split , And delete the mapping relationship between the identifier of the metadata partition contained in the original metadata partition group and the identifier of the original metadata partition group, and create a part of the original metadata partition group The mapping relationship between the ID of the metadata partition and the ID of one of the sub-metadata partition groups, and the ID of another part of the metadata partition contained in the original metadata partition group and the ID of the other sub-metadata partition group The mapping relationship between identifiers.
S703,将一个子元数据分区组及其对应的元数据迁移至所述第二节点。第二节点是扩容部分描述的新节点。S703: Migrate a sub-metadata partition group and its corresponding metadata to the second node. The second node is the new node described in the expansion section.
其中,分区组的迁移是指归属关系的改变,具体的,将子元数据分区组迁移至所述第二节点具体是指将所述子元数据分区组与所述第一节点之间的对应关系修改为与所述第二节点的对应关系。而元数据的迁移是指数据的实际移动,具体的,将子元数据分区 组对应的元数据迁移至所述第二节点是指将所述元数据复制到所述第二节点,并且删除所述第一节点保留的所述元数据。The migration of the partition group refers to the change of the attribution relationship. Specifically, the migration of the sub-metadata partition group to the second node specifically refers to the correspondence between the sub-metadata partition group and the first node The relationship is modified to correspond to the second node. The migration of metadata refers to the actual movement of data. Specifically, the migration of metadata corresponding to a sub-metadata partition group to the second node refers to copying the metadata to the second node and deleting all The metadata retained by the first node.
由于S701中对第一节点的数据分区组和元数据分区组进行了配置,使得配置后的述数据分区组对应的数据的元数据是所述元数据分区组对应的元数据的子集,因此即使所述元数据分区组分裂成至少两个子元数据分区组,所述数据分区组对应的数据的元数据仍然是其中一个子元数据分区组对应的元数据的子集。那么当其中一个子元数据分区组及其对应的元数据迁移至第二节点之后,所述数据分区组对应的数据仍然是由一个节点中存储的元数据来描述的,这就避免了在对数据进行修改,特别是执行垃圾回收时在不同的节点上修改元数据。Since the data partition group and the metadata partition group of the first node are configured in S701, the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group, so Even if the metadata partition group is split into at least two sub-metadata partition groups, the metadata of the data corresponding to the data partition group is still a subset of the metadata corresponding to one of the sub-metadata partition groups. Then when one of the child metadata partition groups and their corresponding metadata is migrated to the second node, the data corresponding to the data partition group is still described by the metadata stored in one node, which avoids The data is modified, especially the metadata is modified on different nodes when garbage collection is performed.
为了保证下一次扩容时,数据分区组对应的数据的元数据仍然是子元数据分区组所对应的元数据的子集,在S703之后还可以执行S704:将所述第一节点中的所述数据分区组分裂成至少两个子数据分区组,所述子数据分区组对应的数据的元数据是所述子元数据分区组对应的元数据的子集。此处分裂的定义与S702中的分裂相同。In order to ensure that the metadata of the data corresponding to the data partition group is still a subset of the metadata corresponding to the sub-metadata partition group during the next capacity expansion, after S703, S704 may also be performed: the The data partition group is split into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to the sub-metadata partition group. The definition of split here is the same as the split in S702.
在图7所提供的节点扩容方法中,数据和所述数据的元数据是存储在同一个节点中的,然而在另一种场景中数据和所述数据的元数据存储在不同的节点中,那么对于某个节点来说,虽然它可能也包括数据分区组和元数据分区组,但是该元数据分区组所对应的元数据可能并非所述数据分区组所对应的数据的元数据,而是其他节点上存储的数据的元数据。对于这种场景,每个第一节点仍然需要对该节点中的数据分区组和元数据分区组进行配置,经过配置后的元数据分区组所包含的元数据分区的数量大于数据分区组所包含的数据分区的数量即可。当第二节点加入所述存储系统之后,每个第一节点再按照S702的描述对元数据分区组进行分裂,将分裂后的一个子元数据分区组迁移至所述第二节点。由于每个第一节点都对其数据分区组和元数据分区组进行了这样的配置,在迁移以后仍然能满足一个数据分区组对应的数据由同一个节点中存储的元数据来描述。举一个具体的例子,第一节点中存储有数据,而该数据的元数据存储在第三节点中。那么第一节点对该数据对应的数据分区组进行配置,第三节点对该元数据对应的元数据分区组进行配置。经过配置,数据分区组对应的数据的元数据是所述配置后的元数据分区组对应的元数据的子集。当第二节点加入所述存储系统时,所述第三节点再将所述元数据分区组分裂成至少两个子元数据分区组,并且将所述至少两个子元数据分区组中的第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点。In the node expansion method provided in FIG. 7, the data and the metadata of the data are stored in the same node, but in another scenario, the data and the metadata of the data are stored in different nodes, Then for a node, although it may also include a data partition group and a metadata partition group, the metadata corresponding to the metadata partition group may not be the metadata of the data corresponding to the data partition group, but Metadata of data stored on other nodes. For this scenario, each first node still needs to configure the data partition group and the metadata partition group in the node. The configured metadata partition group contains more metadata partitions than the data partition group The number of data partitions is sufficient. After the second node joins the storage system, each first node then splits the metadata partition group according to the description of S702, and migrates a split child metadata partition group to the second node. Since each first node has configured its data partition group and metadata partition group in this way, after migration, the data corresponding to a data partition group can still be described by the metadata stored in the same node. As a specific example, data is stored in the first node, and metadata of the data is stored in the third node. Then, the first node configures the data partition group corresponding to the data, and the third node configures the metadata partition group corresponding to the metadata. After configuration, the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the configured metadata partition group. When the second node joins the storage system, the third node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-group of the at least two sub-metadata partition groups The metadata partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
另外,在图7所提供的节点扩容方法中,数据分区组所包含的数据分区的数量小于元数据分区组所包含的元数据分区的数量。在另一种场景中,数据分区组所包含的数据分区的数量等于元数据分区组所包含的元数据分区的数量,当数据分区组所包含的数据分区的数量等于元数据分区组所包含的元数据分区的数量时,如果第二节点加入所述存储系统,不需要对元数据分区组进行分裂,而是直接将第一节点中的多个元数据分区组中的一部分元数据分区组及其对应的元数据迁移至所述第二节点。同样的,对于这种场景也可以分为两种情况。情况1,如果数据和所述数据的元数据存储在同一个节点中,那么每个第一节点都需要将元数据分区组所对应的元数据仅包含该节点中数据分区组所对应的数据的元数据。情况2,如果数据和所述数据的元数据存储在不同的节点中,那么每个第一节点都需要将元数据分区组所包含的元数据分区的数量配置成等于数据分区组所 包含的数据分区的数量。无论是情况1还是情况2,都不需要对元数据分区组进行分裂,直接将该节点中的多个元数据分区组中的一部分元数据分区组及其对应的元数据迁移至所述第二节点即可。但这种场景不适用于仅包含一个元数据分区组的节点。In addition, in the node expansion method provided in FIG. 7, the number of data partitions included in the data partition group is smaller than the number of metadata partitions included in the metadata partition group. In another scenario, the number of data partitions included in the data partition group is equal to the number of metadata partitions included in the metadata partition group. When the number of data partitions included in the data partition group is equal to the number included in the metadata partition group For the number of metadata partitions, if the second node joins the storage system, there is no need to split the metadata partition group, but directly divide a part of the metadata partition groups among the multiple metadata partition groups in the first node and The corresponding metadata is migrated to the second node. Similarly, there are two cases for this scenario. Case 1, if the data and the metadata of the data are stored in the same node, then each first node needs to include the metadata corresponding to the metadata partition group only including the data corresponding to the data partition group in the node Metadata. Case 2, if the data and the metadata of the data are stored in different nodes, then each first node needs to configure the number of metadata partitions contained in the metadata partition group to be equal to the data contained in the data partition group The number of partitions. In either case 1 or 2, there is no need to split the metadata partition group, and directly migrate a part of the metadata partition groups and corresponding metadata in the multiple metadata partition groups in the node to the second Node. But this scenario is not suitable for nodes that contain only one metadata partition group.
另外,在本实施例提供的节点扩容方法所适用的各种场景中,数据分区组及其对应的数据都不需要迁移至第二节点,如果所述第二节点接收到读请求,也可以根据其存储的元数据找到待读取数据的物理地址,从而读取到该数据。由于元数据的数据量大大小于数据的数据量,避免将数据迁移至第二节点可以大大节省节点之间的带宽。In addition, in various scenarios applicable to the node expansion method provided in this embodiment, the data partition group and its corresponding data do not need to be migrated to the second node. If the second node receives a read request, it can also Its stored metadata finds the physical address of the data to be read, so that the data is read. Since the data volume of the metadata is much smaller than the data volume of the data, avoiding migrating the data to the second node can greatly save the bandwidth between the nodes.
本实施例还提供了一种节点扩容装置,如图8所示,图8是所述节点扩容装置的结构示意图。该装置包括配置模块801,分裂模块802,迁移模块803。This embodiment also provides a node capacity expansion device. As shown in FIG. 8, FIG. 8 is a schematic structural diagram of the node capacity expansion device. The device includes a configuration module 801, a split module 802, and a migration module 803.
配置模块801,用于配置存储系统中的第一节点的数据分区组和元数据分区组,所述数据分区组包含多个数据分区,所述元数据分区组包括多个元数据分区,所述数据分区组对应的数据的元数据是所述元数据分区组对应的元数据的子集。具体的,可参考图7所示的S701的描述。The configuration module 801 is configured to configure a data partition group and a metadata partition group of the first node in the storage system. The data partition group includes multiple data partitions, and the metadata partition group includes multiple metadata partitions. The metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group. Specifically, reference may be made to the description of S701 shown in FIG. 7.
分裂模块802,用于当第二节点加入所述存储系统时,将所述元数据分区组分裂成至少两个子元数据分区组。具体的,可参考图7所示的S702的描述,以及扩容部分中与图5、图6相关的描述。The splitting module 802 is configured to split the metadata partition group into at least two sub-metadata partition groups when the second node joins the storage system. Specifically, please refer to the description of S702 shown in FIG. 7 and the description related to FIG. 5 and FIG. 6 in the capacity expansion part.
迁移模块803,用于将所述至少两个子元数据分区组中的一个子元数据分区组及其对应的元数据迁移至所述第二节点。具体的,可参考图7所示的S703的描述。The migration module 803 is configured to migrate one sub-metadata partition group and corresponding metadata among the at least two sub-metadata partition groups to the second node. Specifically, reference may be made to the description of S703 shown in FIG. 7.
可选的,所述装置还包括获取模块804,用于获取扩容后元数据分区组布局和扩容前元数据分区组布局,所述扩容后元数据分区组布局包括所述第二节点加入所述存储系统后所述存储系统中每个节点配置的所述子元数据分区组的数量,以及所述子元数据分区组包含的元数据分区的数量,所述扩容前元数据分区组布局包括所述第二节点加入所述存储系统之前所述第一节点配置的所述元数据分区组的数量,以及所述元数据分区组包含的元数据分区的数量。所述分裂模块802具体用于根据所述扩容后元数据分区组布局和所述扩容前元数据分区组布局,将所述元数据分区组分裂成至少两个子元数据分区组。Optionally, the apparatus further includes an obtaining module 804, configured to obtain a metadata partition group layout after expansion and a metadata partition group layout before expansion, and the metadata partition group layout after expansion includes the second node joining the After the storage system, the number of the sub-metadata partition groups configured for each node in the storage system and the number of meta-data partitions included in the sub-metadata partition group. The layout of the meta-data partition group before capacity expansion includes all The number of the metadata partition groups configured by the first node before the second node joins the storage system, and the number of metadata partitions included in the metadata partition group. The splitting module 802 is specifically configured to split the metadata partition group into at least two sub-metadata partition groups according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.
可选的,所述分裂模块802还用于在所述将至少一个子元数据分区组及其对应的元数据迁移至所述第二节点之后,将所述数据分区组分裂成至少两个子数据分区组,所述子数据分区组对应的数据的元数据是所述子元数据分区组对应的元数据的子集。Optionally, the splitting module 802 is further configured to split the data partition group into at least two sub-data after the at least one sub-metadata partition group and its corresponding metadata are migrated to the second node A partition group, the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to the sub-metadata partition group.
可选的,配置模块801还用于当所述第二节点加入所述存储系统时,保持所述数据分区组对应的数据继续存储在所述第一节点中。Optionally, the configuration module 801 is further configured to keep the data corresponding to the data partition group to continue to be stored in the first node when the second node joins the storage system.
本实施例还提供了一种存储节点,所述存储节点可以是存储阵列,也可以是服务器。当存储节点是存储阵列时,该存储节点包括存储控制器和存储介质。所述存储控制器的结构可以参考图9的结构示意图。当存储节点是服务器时,也可以参考图9的结构示意图。由此,无论存储节点是哪种形态的设备,都至少包括了处理器901和存储器902。所述存储器902中存储有程序903。处理器901、存储器902和通信接口之间通过系统总线连接并完成相互间的通信。This embodiment also provides a storage node. The storage node may be a storage array or a server. When the storage node is a storage array, the storage node includes a storage controller and a storage medium. For the structure of the storage controller, reference may be made to the schematic structural diagram of FIG. 9. When the storage node is a server, reference may also be made to the schematic structural diagram of FIG. 9. Therefore, no matter what kind of device the storage node is, at least the processor 901 and the memory 902 are included. A program 903 is stored in the memory 902. The processor 901, the memory 902 and the communication interface are connected through a system bus and complete communication with each other.
处理器901是单核或多核中央处理单元,或者为特定集成电路,或者为被配置成实施本发明实施例的一个或多个集成电路。存储器902可以为高速RAM存储器,也可以 为非易失性存储器(non-volatile memory),例如至少一个硬盘存储器。存储器902用于存储计算机执行指令。具体的,计算机执行指令中可以包括程序903。当所述存储节点运行时,处理器901运行所述程序903以执行图7所示的S701-S704的方法流程。The processor 901 is a single-core or multi-core central processing unit, or a specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention. The memory 902 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), for example, at least one hard disk memory. The memory 902 is used to store computer-executed instructions. Specifically, the program 903 may be included in the computer execution instructions. When the storage node is running, the processor 901 runs the program 903 to execute the method flow of S701-S704 shown in FIG. 7.
上述图8所示的配置模块801、分裂模块802、迁移模块803以及获取模块804的功能可以由处理器901运行程序903执行,或者由处理器901单独执行。The functions of the configuration module 801, the split module 802, the migration module 803, and the acquisition module 804 shown in FIG. 8 described above may be executed by the processor 901 running the program 903, or executed by the processor 901 alone.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、存储节点或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、存储节点或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的存储节点、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to the embodiments of the present application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, storage node or data The center transmits to another website, computer, storage node or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available medium integrated storage nodes, data centers, and the like. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, Solid State Disk (SSD)), or the like.
应理解,在本申请实施例中,术语“第一”等仅仅是为了指代对象,并不表示相应对象的次序。It should be understood that, in the embodiments of the present application, the terms “first” and the like are only for referring to objects, and do not indicate the order of corresponding objects.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a division of logical functions. In actual implementation, there may be other divisions, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存 储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,存储节点,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of the present application essentially or part of the contribution to the existing technology or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to enable a computer device (which may be a personal computer, a storage node, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. The foregoing storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks and other media that can store program codes .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

  1. 一种存储系统中的节点扩容方法,其特征在于,所述存储系统包括第一节点;A node expansion method in a storage system, characterized in that the storage system includes a first node;
    配置所述第一节点的数据分区组和元数据分区组,所述数据分区组包含多个数据分区,所述元数据分区组包括多个元数据分区,所述数据分区组对应的数据的元数据是所述元数据分区组对应的元数据的子集;Configuring a data partition group and a metadata partition group of the first node, the data partition group includes a plurality of data partitions, the metadata partition group includes a plurality of metadata partitions, and data elements corresponding to the data partition group The data is a subset of the metadata corresponding to the metadata partition group;
    当第二节点加入所述存储系统时,将所述元数据分区组分裂成至少两个子元数据分区组;When the second node joins the storage system, split the metadata partition group into at least two sub-metadata partition groups;
    将所述至少两个子元数据分区组中的第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点。The first sub-metadata partition group of the at least two sub-metadata partition groups and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
  2. 根据权1所述的方法,其特征在于,将所述元数据分区组分裂成至少两个子元数据分区组之前,还包括:The method of claim 1, before splitting the metadata partition group into at least two sub-metadata partition groups, further comprising:
    获取扩容后元数据分区组布局和扩容前元数据分区组布局,其中,Obtain the metadata partition group layout after capacity expansion and the metadata partition group layout before capacity expansion, where,
    所述扩容后元数据分区组布局包括:在所述第二节点加入所述存储系统之后,所述存储系统中每个节点配置的所述子元数据分区组的数量,以及,在所述第二节点加入所述存储系统之后,所述子元数据分区组包含的元数据分区的数量,The layout of the expanded metadata partition group includes: after the second node joins the storage system, the number of the child metadata partition groups configured for each node in the storage system, and, in the first After two nodes join the storage system, the number of metadata partitions included in the sub-metadata partition group,
    所述扩容前元数据分区组布局包括:在所述第二节点加入所述存储系统之前,所述第一节点配置的所述元数据分区组的数量,以及,在所述第二节点加入所述存储系统之前,所述元数据分区组包含的元数据分区的数量;The layout of the metadata partition group before capacity expansion includes: before the second node joins the storage system, the number of the metadata partition groups configured by the first node, and joining the site at the second node Before the storage system, the number of metadata partitions included in the metadata partition group;
    所述将所述元数据分区组分裂成至少两个子元数据分区组包括:The splitting the metadata partition group into at least two sub-metadata partition groups includes:
    根据所述扩容后元数据分区组布局和所述扩容前元数据分区组布局,对所述元数据分区组进行分裂,分裂后的子元数据分区组的数量是至少两个。According to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion, the metadata partition group is split, and the number of split sub-metadata partition groups is at least two.
  3. 根据权1或权2所述的方法,其特征在于,在所述将所述第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点之后还包括:The method according to claim 1 or claim 2, wherein the metadata corresponding to the first sub-metadata partition group and the first sub-metadata partition group is migrated to the second node It also includes:
    将所述数据分区组分裂成至少两个子数据分区组,所述子数据分区组对应的数据的元数据是所述至少两个子元数据分区组中任意一个子元数据分区组对应的元数据的子集。Split the data partition group into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is the metadata corresponding to any one of the sub-metadata partition groups in the at least two sub-metadata partition groups Subset.
  4. 根据权1-3任一所述的方法,其特征在于,还包括:The method according to any one of claims 1-3, further comprising:
    当所述第二节点加入所述存储系统后,保持所述数据分区组对应的数据继续存储在所述第一节点中。After the second node joins the storage system, the data corresponding to the data partition group is kept to be stored in the first node.
  5. 根据权1所述的方法,其特征在于,所述数据分区组对应的数据的元数据是所述至少两个子元数据分区组中任意一个子元数据分区组对应的元数据的子集。The method according to claim 1, wherein the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to any one of the at least two sub-metadata partition groups.
  6. 一种节点扩容装置,其特征在于,所述装置位于存储系统中,包括:A node capacity expansion device, characterized in that the device is located in a storage system and includes:
    配置模块,用于配置所述存储系统中的第一节点的数据分区组和元数据分区组,所述数据分区组包含多个数据分区,所述元数据分区组包括多个元数据分区,所述数据分区组对应的数据的元数据是所述元数据分区组对应的元数据的子集;The configuration module is configured to configure the data partition group and the metadata partition group of the first node in the storage system. The data partition group includes multiple data partitions, and the metadata partition group includes multiple metadata partitions. The metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group;
    分裂模块,用于当第二节点加入所述存储系统时,将所述元数据分区组分裂成至少两个子元数据分区组;A splitting module, configured to split the metadata partition group into at least two sub-metadata partition groups when the second node joins the storage system;
    迁移模块,用于将所述至少两个子元数据分区组中的第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点。The migration module is configured to migrate the first sub-metadata partition group in the at least two sub-metadata partition groups and the metadata corresponding to the first sub-metadata partition group to the second node.
  7. 根据权6所述的装置,其特征在于,还包括获取模块,The device according to claim 6, further comprising an acquisition module,
    所述获取模块,用于获取扩容后元数据分区组布局和扩容前元数据分区组布局,所述扩容后元数据分区组布局包括:在所述第二节点加入所述存储系统之后,所述存储系统中每个节点配置的所述子元数据分区组的数量,以及,在所述第二节点加入所述存储系统之后,所述子元数据分区组包含的元数据分区的数量,所述扩容前元数据分区组布局包括:在所述第二节点加入所述存储系统之前,所述第一节点配置的所述元数据分区组的数量,以及,在所述第二节点加入所述存储系统之前,所述元数据分区组包含的元数据分区的数量;The obtaining module is configured to obtain the layout of the metadata partition group after the expansion and the layout of the metadata partition group before the expansion. The layout of the metadata partition group after the expansion includes: after the second node joins the storage system, the The number of the sub-metadata partition groups configured for each node in the storage system, and the number of metadata partitions included in the sub-metadata partition group after the second node joins the storage system, the The layout of the metadata partition group before capacity expansion includes: before the second node joins the storage system, the number of the metadata partition group configured by the first node, and joining the storage at the second node Before the system, the number of metadata partitions contained in the metadata partition group;
    所述分裂模块具体用于根据所述扩容后元数据分区组布局和所述扩容前元数据分区组布局,对所述元数据分区组进行分裂,分裂后的子元数据分区组的数量是至少两个。The splitting module is specifically configured to split the metadata partition group according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion, and the number of split sub-metadata partition groups is at least Two.
  8. 根据权6或权7所述的装置,其特征在于,所述分裂模块还用于在所述将所述第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点之后,将所述数据分区组分裂成至少两个子数据分区组,所述子数据分区组对应的数据的元数据是所述至少两个子元数据分区组中任意一个子元数据分区组对应的元数据的子集。The device according to claim 6 or claim 7, wherein the splitting module is further configured to divide the first sub-metadata partition group and the metadata corresponding to the first sub-metadata partition group in the After migrating to the second node, split the data partition group into at least two sub-data partition groups, and metadata of data corresponding to the sub-data partition group is any one of the at least two sub-metadata partition groups A subset of metadata corresponding to the metadata partition group.
  9. 根据权6-8任一所述的装置,其特征在于,所述配置模块还用于当所述第二节点加入所述存储系统后,保持所述数据分区组对应的数据继续存储在所述第一节点中。The device according to any one of claims 6-8, wherein the configuration module is further configured to keep the data corresponding to the data partition group to continue to be stored in the storage system after the second node joins the storage system In the first node.
  10. 一种存储节点,其特征在于,所述存储节点位于存储系统中,所述存储节点包括处理器和存储器,所述存储器中存储有程序,所述处理器运行所述程序以执行权1-权4任一所述的方法。A storage node, characterized in that the storage node is located in a storage system, the storage node includes a processor and a memory, a program is stored in the memory, and the processor runs the program to execute the right 1-right 4 The method of any one.
  11. 一种存储系统,其特征在于,包括第一节点和第三节点;A storage system, characterized in that it includes a first node and a third node;
    所述第一节点,用于配置所述第一节点的数据分区组,所述数据分区组包含多个数据分区;The first node is used to configure a data partition group of the first node, and the data partition group includes multiple data partitions;
    所述第三节点,用于配置所述第三节点的元数据分区组,所述元数据分区组包含多个元数据分区;The third node is configured to configure a metadata partition group of the third node, and the metadata partition group includes multiple metadata partitions;
    所述配置后的数据分区组对应的数据的元数据是所述配置后的元数据分区组对应的元数据的子集;The metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the configured metadata partition group;
    当第二节点加入所述存储系统时,所述第三节点还用于将所述元数据分区组分裂成至少两个子元数据分区组;以及将所述至少两个子元数据分区组中的第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点。When the second node joins the storage system, the third node is also used to split the metadata partition group into at least two sub-metadata partition groups; and the first of the at least two sub-metadata partition groups A sub-metadata partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
  12. 根据权11所述的存储系统,其特征在于,The storage system according to claim 11, wherein:
    所述第三节点还用于获取扩容后元数据分区组布局和扩容前元数据分区组布局,所述扩容后元数据分区组布局包括:在所述第二节点加入所述存储系统之后,所述存储系统中每个节点配置的所述子元数据分区组的数量,以及,在所述第二节点加入所述存储系统之后,所述子元数据分区组包含的元数据分区的数量,所述扩容前元数据分区组布局包括:在所述第二节点加入所述存储系统之前,所述第三节点配置的所述元数据分区组的数量,以及,在所述第二节点加入所述存储系统之前,所述元数据分区组包含的元数据分区的数量;The third node is also used to obtain the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion. The layout of the metadata partition group after capacity expansion includes: after the second node joins the storage system, The number of the sub-metadata partition groups configured by each node in the storage system, and the number of metadata partitions contained in the sub-metadata partition group after the second node joins the storage system, so The layout of the metadata partition group before capacity expansion includes: before the second node joins the storage system, the number of the metadata partition groups configured by the third node, and joining the second node at the second node Before the storage system, the number of metadata partitions included in the metadata partition group;
    当所述第二节点加入所述存储系统时,所述第三节点具体用于根据所述扩容后元数据分区组布局和所述扩容前元数据分区组布局,对所述元数据分区组进行分裂,分裂后的子元数据分区组的数量是至少两个。When the second node joins the storage system, the third node is specifically configured to perform the metadata partition group according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion Split, the number of split sub-metadata partition groups is at least two.
  13. 根据权11或权12所述的存储系统,其特征在于,The storage system according to claim 11 or claim 12, wherein:
    所述第三节点还用于在所述将所述第一子元数据分区组及其所述第一子元数据分区组对应的元数据迁移至所述第二节点之后,将所述数据分区组分裂成至少两个子数据分区组,所述子数据分区组对应的数据的元数据是所述至少两个子元数据分区组中任意一个子元数据分区组对应的元数据的子集。The third node is also used to partition the data after the migration of the metadata corresponding to the first sub-metadata partition group and the first sub-metadata partition group to the second node The group is split into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to any one of the sub-metadata partition groups in the at least two sub-metadata partition groups.
  14. 根据权11-13任一所述的存储系统,其特征在于,The storage system according to any one of claims 11-13, characterized in that
    当所述第二节点加入所述存储系统后,所述第一节点还用于保持所述数据分区组对应的数据继续存储在所述第一节点中。After the second node joins the storage system, the first node is also used to keep the data corresponding to the data partition group to continue to be stored in the first node.
  15. 根据权11所述的存储系统,其特征在于,所述数据分区组对应的数据的元数据是所述至少两个子元数据分区组中任意一个子元数据分区组对应的元数据的子集。The storage system according to claim 11, wherein the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to any one of the at least two sub-metadata partition groups.
PCT/CN2019/111888 2018-10-25 2019-10-18 Node expansion method in storage system and storage system WO2020083106A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19877173.5A EP3859506B1 (en) 2018-10-25 2019-10-18 Node expansion method in storage system and storage system
US17/239,194 US20210278983A1 (en) 2018-10-25 2021-04-23 Node Capacity Expansion Method in Storage System and Storage System

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811249893 2018-10-25
CN201811249893.8 2018-10-25
CN201811571426.7A CN111104057B (en) 2018-10-25 2018-12-21 Node capacity expansion method in storage system and storage system
CN201811571426.7 2018-12-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/239,194 Continuation US20210278983A1 (en) 2018-10-25 2021-04-23 Node Capacity Expansion Method in Storage System and Storage System

Publications (1)

Publication Number Publication Date
WO2020083106A1 true WO2020083106A1 (en) 2020-04-30

Family

ID=70331214

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111888 WO2020083106A1 (en) 2018-10-25 2019-10-18 Node expansion method in storage system and storage system

Country Status (1)

Country Link
WO (1) WO2020083106A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111756828A (en) * 2020-06-19 2020-10-09 广东浪潮大数据研究有限公司 Data storage method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739622A (en) * 2011-04-15 2012-10-17 北京兴宇中科科技开发股份有限公司 Expandable data storage system
US20130041904A1 (en) * 2011-08-10 2013-02-14 Goetz Graefe Computer indexes with multiple representations
CN103036796A (en) * 2011-09-29 2013-04-10 阿里巴巴集团控股有限公司 Method and device for updating routing information
CN103310000A (en) * 2013-06-25 2013-09-18 曙光信息产业(北京)有限公司 Metadata management method
CN104378447A (en) * 2014-12-03 2015-02-25 深圳市鼎元科技开发有限公司 Non-migration distributed storage method and non-migration distributed storage system on basis of Hash ring
CN104636286A (en) * 2015-02-06 2015-05-20 华为技术有限公司 Data access method and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739622A (en) * 2011-04-15 2012-10-17 北京兴宇中科科技开发股份有限公司 Expandable data storage system
US20130041904A1 (en) * 2011-08-10 2013-02-14 Goetz Graefe Computer indexes with multiple representations
CN103036796A (en) * 2011-09-29 2013-04-10 阿里巴巴集团控股有限公司 Method and device for updating routing information
CN103310000A (en) * 2013-06-25 2013-09-18 曙光信息产业(北京)有限公司 Metadata management method
CN104378447A (en) * 2014-12-03 2015-02-25 深圳市鼎元科技开发有限公司 Non-migration distributed storage method and non-migration distributed storage system on basis of Hash ring
CN104636286A (en) * 2015-02-06 2015-05-20 华为技术有限公司 Data access method and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3859506A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111756828A (en) * 2020-06-19 2020-10-09 广东浪潮大数据研究有限公司 Data storage method, device and equipment
CN111756828B (en) * 2020-06-19 2023-07-14 广东浪潮大数据研究有限公司 Data storage method, device and equipment

Similar Documents

Publication Publication Date Title
CN107422983B (en) Method and apparatus for tenant-aware storage sharing platform
US20200210082A1 (en) Compound storage system and storage control method to configure change associated with an owner right to set the configuration change
US20160179581A1 (en) Content-aware task assignment in distributed computing systems using de-duplicating cache
US8271559B2 (en) Storage system and method of controlling same
US20100199065A1 (en) Methods and apparatus for performing efficient data deduplication by metadata grouping
US20210278983A1 (en) Node Capacity Expansion Method in Storage System and Storage System
US11262916B2 (en) Distributed storage system, data processing method, and storage node
WO2021008197A1 (en) Resource allocation method, storage device, and storage system
CN109299190B (en) Method and device for processing metadata of object in distributed storage system
CN110908589B (en) Data file processing method, device, system and storage medium
JP6288596B2 (en) Data processing method and apparatus
WO2019047026A1 (en) Data migration method and system and intelligent network card
US20190114076A1 (en) Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium
TW201531862A (en) Memory data versioning
JP2015532734A (en) Management system for managing physical storage system, method for determining resource migration destination of physical storage system, and storage medium
WO2020083106A1 (en) Node expansion method in storage system and storage system
JP2004334481A (en) Virtualized information management apparatus
WO2021088586A1 (en) Method and apparatus for managing metadata in storage system
US20200159454A1 (en) Large-scale storage system and data placement method in large-scale storage system
CN104426965A (en) Self-management storage method and system
US11947419B2 (en) Storage device with data deduplication, operation method of storage device, and operation method of storage server
WO2023000686A1 (en) Method and apparatus for data storage in storage system
US20210311654A1 (en) Distributed Storage System and Computer Program Product
CN116594551A (en) Data storage method and device
WO2020215223A1 (en) Distributed storage system and garbage collection method used in distributed storage system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19877173

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019877173

Country of ref document: EP

Effective date: 20210426