WO2020083106A1

WO2020083106A1 - Node expansion method in storage system and storage system

Info

Publication number: WO2020083106A1
Application number: PCT/CN2019/111888
Authority: WO
Inventors: 肖建龙; 王�锋; 王奇; 王晨; 谭春华
Original assignee: 华为技术有限公司
Priority date: 2018-10-25
Filing date: 2019-10-18
Publication date: 2020-04-30

Abstract

Disclosed are a node expansion method in a storage system and a storage system. The storage system comprises a first node configured with a data partition group and a metadata partition group. The data partition group comprises a plurality of data partitions, the metadata partition group comprises a plurality of metadata partitions. Metadata of data corresponding to the data partition group are subsets of metadata corresponding to the metadata partition group. When a second node joins the storage system, the first node splits the metadata partition group into at least two sub metadata partition groups, wherein a first sub metadata partition group and metadata corresponding thereto are migrated to the second node. The expenditure on bandwidth between storage nodes can be reduced.

Description

Node expansion method and storage system in storage system

Technical field

This application relates to the storage field, and more specifically, to a node expansion method and storage system in a storage system.

Background technique

In a distributed storage system scenario, capacity expansion is required when the storage system has insufficient free space. When a new node joins the storage system, the original node will migrate a part of the partition and its corresponding data to the new node. Data migration between storage nodes is bound to consume bandwidth.

Summary of the invention

This application provides a node expansion method and a storage system in a storage system, which can save bandwidth between storage nodes.

In the first aspect, a node expansion method in a storage system is provided. The storage system includes one or more first nodes. For each first node, both data and metadata of the data are stored. According to this method, the first node configures the data partition group and the metadata partition group of the node, the data partition group includes multiple data partitions, the metadata partition group includes multiple metadata partitions, and the data partition group corresponds to The metadata of the data is a subset of the metadata corresponding to the metadata partition group. The meaning of the subset is that the number of data partitions included in the data partition group is less than the number of metadata partitions included in the metadata partition group, and the metadata corresponding to a part of the metadata partitions included in the metadata partition group is used In order to describe the data corresponding to the data partition group, the metadata corresponding to another part of the metadata partition is used to describe the data corresponding to the other data partition group. When a second node joins the storage system, the first node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-metadata among the at least two sub-metadata partition groups The partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.

According to the method provided in the first aspect, when the second node joins, the sub-metadata partition group of the first node split and its corresponding metadata are migrated to the second node, because the amount of metadata is much less than The data volume of the data saves the bandwidth between the nodes as compared with the data migration to the second node in the prior art.

In addition, because the data partition group and the metadata partition group of the first node are configured, the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group, Then even if the metadata partition group is split into at least two sub-metadata partition groups after capacity expansion, it can still ensure to a certain extent that the metadata of the data corresponding to the data partition group is the metadata corresponding to any sub-metadata partition group Subset, after migrating one of the sub-metadata partition groups and their corresponding metadata to the second node, the data corresponding to the data partition group is still described by the metadata stored in the same node, avoiding Modify the data, especially the metadata on different nodes when performing garbage collection.

With reference to the first implementation of the first aspect, in the second implementation, the first node obtains the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion. The layout of the metadata partition group after capacity expansion includes all After the second node joins the storage system, the number of the sub-metadata partition groups configured by each node in the storage system, and the number of meta-data partitions included in the sub-metadata partition group, before the expansion The layout of the metadata partition group includes the number of the metadata partition groups configured by the first node before the second node joins the storage system, and the number of metadata partitions included in the metadata partition group. The first node splits the metadata partition group into at least two sub-metadata partition groups according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.

With reference to any one of the above implementations of the first aspect, in a third implementation, after the migration, the first node splits the data partition group into at least two sub-data partition groups, and the sub-data partition groups correspond to The metadata of the data is a subset of the metadata corresponding to the sub-metadata partition group. Splitting the data partition group into smaller data sub-data partition groups is to prepare for the next expansion, so that the metadata of the data corresponding to the sub-data partition group is always a subset of the metadata corresponding to the sub-metadata partition group .

With reference to any one of the above implementations of the first aspect, in a fourth implementation, when the second node joins the storage system, the first node maintains correspondence between the data partition group and the data partition group Of data continues to be stored in the first node. Since only the metadata is migrated, the data is not migrated, and the data volume of the metadata is usually much smaller than the data volume, so the bandwidth between the nodes is saved.

With reference to the first implementation of the first aspect, in the fifth implementation, it is further clarified that the metadata of the data corresponding to the data partition group corresponds to any one of the at least two sub-metadata partition groups Subset of metadata. In order to ensure that the data corresponding to the data partition group is still described by the metadata stored in the same node, it is avoided to modify the data, especially to modify the metadata on different nodes when performing garbage collection.

In a second aspect, a node expansion device is provided for implementing the first aspect and any method provided by the implementation thereof.

In a third aspect, a storage node is provided for implementing the method of the first aspect and any of its implementations.

In a fourth aspect, a computer program product of a node expansion method is provided, including a computer-readable storage medium storing program code, and the instructions included in the program code are used to perform the operations described in the first aspect and any one of its implementations method.

According to a fifth aspect, a storage system is provided. The storage system includes at least a first node and a third node. Moreover, in the storage system, data and metadata describing the data are stored in different nodes, for example, data is stored in the first node, and metadata of the data is stored in the second node . The first node is used to configure a data partition group, the data partition group corresponds to the data, and the third node is used to configure a metadata partition group, and the metadata of the data corresponding to the configured data partition group is the configuration A subset of metadata corresponding to the subsequent metadata partition group. When the second node joins the storage system, the third node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-metadata among the at least two sub-metadata partition groups The partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.

In the storage system provided in the fifth aspect, although the data and the metadata of the data are stored in different nodes, since the data partition group and the metadata partition group are configured in the same way as the first aspect, after the migration The metadata that can still satisfy the data corresponding to any one data partition group is stored in one node, and there is no need to go to two nodes to obtain or modify the metadata.

According to a sixth aspect, a node expansion method is provided, which is applied to the storage system provided in the fifth aspect, and the first node in the storage system performs the function provided in the fifth aspect.

According to a seventh aspect, there is provided a node capacity expansion device, which is located in the storage system provided in the fifth aspect and is used to perform the function provided in the fifth aspect.

In an eighth aspect, a node expansion method in a storage system is provided. The storage system includes one or more first nodes. For each first node, both data and metadata of the data are stored. And the first node includes at least two metadata partition groups and at least two data partition groups, and the metadata corresponding to each metadata partition group is used to describe the data corresponding to one of the data partition groups. The first node configures the metadata partition group and the data partition group so that the number of metadata partitions included in the metadata partition group is equal to the number of data partitions included in the data partition group. When the second node joins the storage system, the first metadata partition group of the at least two metadata partition groups and the metadata corresponding to the first metadata partition group are migrated to the second node. The data corresponding to the at least two data partition groups continues to be maintained in the first node.

In the storage system provided in the eighth aspect, after the migration, the metadata that can satisfy the data corresponding to any one data partition group is stored in one node, and there is no need to go to two nodes to obtain or modify the metadata.

In a ninth aspect, a node expansion method is provided, which is applied to the storage system provided in the eighth aspect, and the first node in the storage system performs the function provided in the eighth aspect.

According to a tenth aspect, there is provided a node capacity expansion device, which is located in the storage system provided by the fifth aspect and is used to perform the function provided by the eighth aspect.

BRIEF DESCRIPTION

FIG. 1 is a schematic diagram of a scenario to which the technical solutions of embodiments of the present invention can be applied.

2 is a schematic diagram of a storage unit according to an embodiment of the invention.

FIG. 3 is a schematic diagram of a metadata partition group and a data partition group provided by an embodiment of the present invention.

FIG. 4 is a schematic diagram of another metadata partition group and data partition group provided by an embodiment of the present invention.

5 is a schematic layout diagram of a metadata partition before expansion provided by an embodiment of the present invention.

6 is a schematic layout diagram of a metadata partition after capacity expansion provided by an embodiment of the present invention.

7 is a schematic flowchart of a node expansion method according to an embodiment of the present invention.

8 is a schematic structural diagram of a node expansion device provided by an embodiment of the present invention.

9 is a schematic structural diagram of a storage node provided by an embodiment of the present invention.

detailed description

In the embodiment of the present application, when the capacity is expanded, the metadata is migrated to the newly added node, and the retained data continues to be stored in the original node. And through configuration, it is always ensured that the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group, so that the data corresponding to one data partition group is only described by the metadata stored in one node. So as to achieve the purpose of saving bandwidth. The technical solutions in this application will be described below with reference to the drawings.

The technical solutions of the embodiments of the present application can be applied to various storage systems. The technical solutions of the embodiments of the present application are described below by taking a distributed storage system as an example, but the embodiments of the present application are not limited thereto. In a distributed storage system, data is distributed and stored on multiple storage nodes, and multiple storage nodes share the storage load. This storage method not only improves the reliability, availability, and access efficiency of the system, but is also easy to expand. The storage device is, for example, a storage server, or a combination of a storage controller and a storage medium.

FIG. 1 is a schematic diagram of a scenario to which the technical solutions of the embodiments of the present application can be applied. As shown in FIG. 1, a client server 101 and a storage system 100 communicate with each other. The storage system 100 includes a switch 103 and a plurality of storage nodes (or “nodes” for short) 104 and the like. Among them, the switch 103 is an optional device. Each storage node 104 may include multiple hard disks or other types of storage media (such as solid-state hard disks or shingled magnetic recording) for storing data. The following describes the embodiment of the present application in four parts.

1. The process of storing data.

In order to ensure that data is stored uniformly in each storage node 104, when selecting a storage node, a distributed hash table (Distributed Hash Table, DHT) method is generally used for routing, but the embodiment of the present application does not limit this. That is to say, in the technical solutions of the embodiments of the present application, various possible routing methods in the storage system may be adopted. According to the distributed hash table method, the hash ring is evenly divided into several parts, each part is called a partition (partition), and each partition corresponds to a set size of storage space. It can be understood that the more partitions, the smaller the storage space corresponding to each partition, and the fewer the partitions, the larger the storage space corresponding to each partition. In practical applications, the number of partitions is often large (in this embodiment, 4096 partitions are used as an example). For convenience of management, these partitions are divided into multiple partition groups, and each partition group contains the same number of partitions. In the case where absolute averaging is not possible, as long as the number of partitions contained in each partition group is basically the same. For example, 4096 partitions are divided into 144 partition groups, where partition group 0 contains partition 0-partition 27, partition group 1 contains partition 28-partition 57 ..., and partition group 143 includes partition 4066-partition 4095. The partition group has its own identifier, which is used to uniquely identify the partition group. Similarly, the partition has its own identification, which is used to uniquely identify the partition. The identifier can be a number, a character string, or a combination of a number and a character string. In this embodiment, each partition group corresponds to one storage node 104. The meaning of "corresponding" means that all data of the same partition group located by the hash value will be stored in the same storage node 104.

The client server 101 sends a write request to any storage node 104, where the write request carries the data to be written and the virtual address of the data. The virtual address refers to the identifier and offset of the logical unit (LU) to which the data is to be written, and the virtual address is an address that is visible to the client server 101. The storage node 104 that has received the write request performs a hash operation according to the virtual address of the data to obtain a hash value, and the hash value can uniquely determine a target partition. After the target partition is determined, the partition group where the target partition is located is also determined. According to the correspondence between the partition group and the storage node, the storage node that receives the write request may forward the write request to the partition group Corresponding storage node. One partition group corresponds to one or more storage nodes. The corresponding storage node (in order to distinguish it from other storage nodes 104, referred to herein as the first storage node) writes the write request to its cache, and performs persistent storage when the conditions are met.

In this embodiment, each storage node includes at least one storage unit. The storage unit is a logical space, and the actual physical space still comes from multiple storage nodes. Please refer to FIG. 2, which is a schematic structural diagram of a storage unit provided by this embodiment. A storage unit is a collection containing multiple logical blocks. A logical block is a spatial concept. Its size is 4MB as an example, but it is not limited to 4MB. One storage node 104 (still taking the first storage node as an example) uses or manages the storage space of other storage nodes 104 in the storage system 100 in the form of logical blocks. The logical blocks on the hard disks from different storage nodes 104 can form a logical block set. The storage node 104 then divides the logical block set into data storage according to the set Redundant Array of Independent Disks (RAID) type. Unit and check storage unit. The set of logical blocks containing the data storage unit and the verification storage unit is called a storage unit. Wherein, the data storage unit includes at least two logical blocks for storing data distribution, and the check storage unit includes at least one check logical block for storing check fragments. The set of logical blocks containing the data storage unit and the verification storage unit is called a storage unit. Assume that a logical block is taken from each of the six storage nodes to form a logical block set, and then the first storage node groups the logical blocks in the logical block set according to the RAID type (taking RAID 6 as an example), for example, logical block 1, logical block 2. Logic block 3 and logic block 4 form a data storage unit, and logic block 5 and logic block 6 form a check storage unit. It can be understood that, according to the redundancy protection mechanism of RAID 6, when any two data units or check units fail, the failed units can be reconstructed according to the remaining data units or check units.

When the data in the cache of the first storage node reaches a set threshold, the data can be divided into multiple data fragments according to the set RAID type, and a check fragment can be calculated to divide the data The slices and check fragments are stored in the storage unit. These data fragments and corresponding check fragments constitute a stripe. One storage unit can store multiple stripes, and is not limited to the three stripes shown in FIG. 2. For example, when the data to be stored in the first storage node reaches 32KB (8KB * 4), divide the data into 4 data fragments, each data fragment is 8KB, and then calculate to obtain 2 check fragments , Each check fragment is also 8KB. The first storage node then sends each shard to the storage node where it is located for persistent storage. Logically, the data is written into the storage unit of the first storage node. Physically, the data is still stored in multiple storage nodes. For each shard, the identifier of the storage unit where it is located and the location inside the storage unit are the logical addresses of the shards, and the actual address of the shard in the storage node is the physical address.

Second, the process of storing metadata

After the data is stored in the storage node, in order to find these data in the future, we also need to store the description information of these data. We call the description information of these description data as metadata. When receiving a read request, the storage node usually finds metadata of the data to be read according to the virtual address carried in the read request, and then further obtains the data to be read according to the metadata. The metadata includes but is not limited to: the correspondence between the logical address and the physical address of each slice, and the correspondence between the virtual address of the data and the logical addresses of the slices contained in the data. The logical address set of each fragment contained in the data is also the logical address of the data.

Similar to the process of storing data, the partition where the metadata is located is also determined according to the virtual address carried in the read request or the write request. Specifically, a hash operation is performed on the virtual address to obtain a hash value, and the hash value A target partition can be uniquely determined, so as to further determine the target partition group where the target partition is located, and then send the metadata to be stored to the storage node (for example, the first storage node) corresponding to the target partition group. When the metadata to be stored in the first storage node reaches a set threshold (for example, 32KB), the metadata is divided into 4 data fragments, and then 2 check fragments are calculated. Then send these fragments to multiple storage nodes.

In this embodiment, the data partition and the metadata partition are independent of each other. In other words, data has its own partitioning mechanism, and metadata also has its own partitioning mechanism. However, the total number of data partitions and the total number of metadata partitions are the same. For example, the total number of data partitions is 4096, and the total number of metadata partitions is also 4096. For convenience of description, in the embodiment of the present invention, a partition corresponding to data is called a data partition, and a partition corresponding to metadata is called a metadata partition. The partition group corresponding to data is called a data partition group, and the partition group corresponding to metadata is called a metadata partition group. Since the metadata partition and the data partition are determined according to the virtual address carried in the read request or the write request, the metadata corresponding to a metadata partition is used to describe the data corresponding to the data partition having the same identifier. For example, the metadata corresponding to metadata partition 1 is used to describe the data corresponding to data partition 1, the metadata corresponding to metadata partition 2 is used to describe the data corresponding to data partition 2, and the metadata corresponding to metadata partition N is used to describe the data Data corresponding to partition N, where N is an integer greater than or equal to 2. The data and the metadata of the data may be stored in the same storage node, or may be stored in different storage nodes.

After the metadata is stored, when the storage node receives the read request, it can learn the physical address of the data to be read by reading the metadata. Specifically, when any storage node 104 receives the read request sent by the client server 101, the node 104 performs a hash calculation on the virtual address carried in the read request to obtain a hash value, thereby obtaining the corresponding hash value 'S metadata partition and its metadata partition group. Assuming that the storage unit corresponding to the metadata partition group belongs to the first storage node, the storage node 104 receiving the read request will forward the read request to the first storage node. The first storage node reads the metadata of the data to be read from the storage unit. According to the metadata, the first storage node obtains each piece that constitutes the data to be read from multiple storage nodes, verifies that the data is to be read, and then aggregates the data to be read back to the client server 101.

3. Capacity expansion

As more and more data is stored in the storage system 100, its storage space is also decreasing. Therefore, it is necessary to increase the number of storage nodes in the storage system 100. This process is called capacity expansion. After a new storage node (abbreviation: new node) joins the storage system 100, the storage system 100 migrates the partition of the old storage node (abbreviation: old node) and data corresponding to the partition to the new node. For example, assuming that the storage system 100 originally had 8 storage nodes, and then expanded to become 16 storage nodes, then the original 8 storage nodes need to migrate half of the partitions and the data corresponding to these partitions to the newly added 8 storage nodes. . In order to save bandwidth resources between storage nodes, there is currently a method of migrating only metadata partitions and corresponding metadata, and not migrating data partitions. After the metadata is migrated to the new storage node, since the metadata records the correspondence between the logical address and the physical address of the data, even if the client server 101 sends a read request to the new node, the data can be found through the correspondence The position in the old node thus reads the data. For example, if the metadata corresponding to metadata partition 1 is migrated to a new node, when the client server 101 sends a read request to the new node to read the data corresponding to data partition 1, although the data corresponding to data partition 1 is not It migrates to the new node, but it can still find the physical address of the data to be read according to the metadata corresponding to the metadata partition 1, and read the data from the old node.

In addition, when expanding a node, the partition and its data are usually migrated in units of partition groups. If the metadata corresponding to the metadata partition group is less than the metadata used to describe the data corresponding to the data partition group, it will cause the same storage unit to be referenced by at least two metadata partition groups, thereby bringing management inconvenient.

Generally, the number of partitions included in the metadata partition group is usually smaller than the number of partitions included in the data partition group. Please refer to FIG. 3, each metadata partition group in FIG. 3 contains 32 partitions, and each data partition group contains 64 partitions. Exemplarily, the data partition group 1 includes partition 0 to partition 63. The data corresponding to partition 0 to partition 63 are all stored in the storage unit 1, the metadata partition group 1 includes partition 0 to partition 31, and the metadata partition group 2 includes partition 32 to partition 63. It can be seen that all the partitions included in the metadata partition group 1 and the metadata partition group 2 are used to describe the data in the storage unit 1. Before the capacity expansion, the metadata partition group 1 and the metadata partition group 2 point to the storage unit 1 respectively. After the new node is added, it is assumed that the metadata partition group 1 in the old node and its corresponding metadata are migrated to the new storage node. After the migration, there is no longer metadata partition group 1 in the old node. Delete (indicated by a dotted arrow), the metadata partition group 1 in the new node points to the storage unit 1. In addition, the metadata partition group 2 in the old node is not migrated, and still points to the storage unit 1. Then, after the expansion, the storage unit 1 is simultaneously referenced by the metadata partition group 2 in the old node and the metadata partition group 1 in the new node. When the data in the storage unit 1 changes, it is necessary to find and modify the corresponding metadata in the two storage nodes (the old node and the new node), which will cause management complexity, especially the complexity of the garbage collection operation. degree.

To solve the above problem, in this embodiment, the number of partitions included in the metadata partition group is set to be greater than or equal to the number of partitions included in the data partition group. In other words, the metadata corresponding to one metadata partition group is more than or equal to the metadata used to describe the data corresponding to one data partition group. For example, each metadata partition group contains 64 partitions, and each data partition group contains 32 partitions. As shown in FIG. 4, the metadata partition group 1 includes partitions 0-63, the data partition group 1 includes partitions 0-31, and the data partition group 2 includes partitions 32-63. The data corresponding to the data partition group 1 is stored in the storage unit 1, and the data corresponding to the data partition group 2 is stored in the storage unit 2. Before the expansion, the metadata partition group 1 in the old node points to storage unit 1 and storage unit 2, respectively. After capacity expansion, the metadata partition group 1 and its corresponding metadata are migrated to the new storage node. Then the metadata partition group 1 in the new node points to storage unit 1 and storage unit 2, respectively. Since the old node no longer exists The metadata partition group 1 is described, so its pointing relationship is deleted (indicated by a dotted arrow). It can be seen that both the storage unit 1 and the storage unit 2 are only referenced by one metadata partition group, which simplifies the management complexity.

Therefore, in this embodiment, the metadata partition group and the data partition group are configured before capacity expansion, so that the number of partitions included in the metadata partition group is set to be greater than the number of partitions included in the data partition group. After capacity expansion, the metadata partition group in the old node is split into at least two sub-metadata partition groups, and then at least one sub-metadata partition group and its corresponding metadata are migrated to the new node. Then split the data partition group in the old node into at least two sub-data partition groups, so that the number of partitions included in the sub-metadata partition group is set to be greater than or equal to the number of partitions included in the sub-data partition group. Prepare for the next expansion.

The following describes the expansion process with a specific example. Please refer to FIG. 5, which is a distribution diagram of the metadata partition groups of each storage node before expansion.

In this embodiment, the number of partition groups allocated to each storage node may be preset. When the storage node includes multiple processing units, in order to evenly distribute read and write requests on each processing unit, the embodiment of the present invention may also set each processing unit to correspond to a certain number of partition groups. The processing unit refers to the CPU. As shown in Table 1:

存储节点的数量Number of storage nodes	处理单元的数量Number of processing units	分区组的数量Number of partition groups
33	24twenty four	144144
44	3232	192192
55	4040	240240
66	4848	288288
77	5656	336336
88	6464	384384
99	7272	432432
1010	8080	480480
1111	8888	528528
1212	9696	576576
1313	104104	624624
1414	112112	672672
1515	120120	720720

Table 1

Table 1 describes the relationship between nodes and the processing units of the nodes and the partition groups. For example, each node has 8 processing units and each processing unit is assigned a partition group of 6, then each node is assigned a partition The number of groups is 48. Assume that the storage system 100 has three storage nodes before capacity expansion, and the number of partition groups it has is 144. According to the foregoing description, the total number of partitions is configured when the storage system 100 is initialized, for example, the total number of partitions is 4096 partitions. If 4096 is to be evenly distributed among 144 partition groups, then each partition group needs to have 4096/144 = 28.44. However, the number of partitions contained in each partition group must be an integer and be the Nth power of 2, where N is an integer greater than or equal to 0. Therefore, the 4096 packets cannot be distributed evenly among the 144 partition groups. What can be determined is that 28.44 is less than 32 (2 to the 5th power) and greater than 16 (2 to the 4th power), so the X first partition group of these 144 partition groups contains 32 partitions and Y second The partition group contains 16 partitions. Among them, X, Y satisfy the following equations: 32X + 16Y = 4096, and X + Y = 144.

It is calculated by the above two equations that X = 112 and Y = 32. This means that there are 112 first partition groups and 32 second partition groups out of 144 partition groups, where each first partition group contains 32 partitions and each second partition group contains 16 partitions. Then calculate the number of the first partition group configured by each processing unit according to the total number of the first partition group and the total number of processing units (112 / (3 * 8) = 4 ... 16), and then according to the total number of the second partition group and the processing unit The total number of is calculated for the second partition group configured by each processing unit (32 / (3 * 8) = 1 ... 8). It can be concluded that each processing unit is configured with at least 4 first partition groups and 2 second partition groups, and the remaining 8 second partitions are evenly distributed on 3 nodes (as shown in FIG. 5).

Please refer to FIG. 6, which is a distribution diagram of metadata partition groups of each storage node after capacity expansion. Assuming that the storage system 100 newly adds 2 storage nodes, then the storage system 100 has 5 storage nodes at this time. According to Table 1, 5 storage nodes have a total of 40 processing units, and each processing unit is configured with 6 partition groups, so 5 storage nodes have a total of 240 partition groups. The total number of partitions is 4096 partitions. If 4096 is to be evenly distributed among 240 partition groups, then each partition group needs to have 4096/240 = 17.07. However, the number of partitions contained in each partition group must be an integer and be the Nth power of 2, where N is an integer greater than or equal to 0. Therefore, these 4096 packets cannot be distributed evenly among 240 partition groups. What can be determined is that 17.07 is less than 32 (2 to the 5th power) and greater than 16 (2 to the 4th power), so the X first partition group of these 240 partition groups contains 32 partitions and Y second The partition group contains 16 partitions. Among them, X, Y satisfy the following equations: 32X + 16Y = 4096, and X + Y = 240.

It is calculated by the above two equations that X = 16 and Y = 224. This means that of the 240 partition groups, there are 16 first partition groups and 224 second partition groups, where each first partition group contains 32 partitions and each second partition group contains 16 partitions. Then calculate the number of the first partition group configured by each processing unit according to the total number of the first partition group and the total number of processing units (16 / (5 * 8) = 0 ... 16), and then according to the total number of the second partition group and the processing unit The total number of is calculated for the second partition group configured by each processing unit (224 / (5 * 8) = 5 ... 24). It is concluded that only 16 processing units are configured with a first partition group, each processing unit is configured with at least 5 second partition groups, and the remaining 24 second partitions are distributed as evenly as possible on 5 nodes (such as (Figure 6).

According to the schematic diagram of the partition layout of the three nodes before the expansion and the schematic diagram of the partition layout of the five nodes after the expansion, a part of the first partition group of the three nodes before the expansion can be split into two second partition groups, and then according to Figure 6 The partition distribution of each node is shown, and some first partition groups and second partition groups are migrated from 3 nodes to node 4 and node 5. For example, according to FIG. 5, there are 112 first partition groups in the storage system 100 before capacity expansion, and 16 first partition groups after capacity expansion, so 96 of the 112 first partition groups need to be in the 112 first partition groups Split. 96 first partition groups are split into 192 second partition groups, so after the split, there are 16 first partition groups and 224 second partitions on the 3 nodes, but each node migrates part of the first partition group and part The second partition group into node 4 and node 5. Taking the processing unit 1 of the node 1 as an example, according to FIG. 5, the processing unit 1 before expansion is configured with 4 first partition groups and 3 second partition groups, and the processing unit 1 after expansion according to FIG. 5 is configured with 1 First partition group and 5 partition groups. This shows that there are three first partition groups in the processing unit 1 that need to be migrated out, or split into multiple second partition groups and then migrated out. As for how many of these three first partition groups are directly migrated to newly added nodes, and how many are migrated to newly added nodes after splitting, this embodiment does not impose any restrictions on this, as long as the migration meets the requirements shown in FIG. 6 It can be distributed by partition. The processing units of the remaining nodes also migrate and split in the same way.

In the above example, the three storage nodes before expansion split some of the first partition group into the second partition group and then migrate to the new node. In another embodiment, part of the first partition group can also be migrated to After the new node is split again, the partition distribution shown in FIG. 6 can also be achieved.

It should be noted that the above description and the example of FIG. 5 are for the metadata partition group. However, for the data partition group, the number of data partitions contained in each data partition group needs to be smaller than that contained in the metadata partition group The number of metadata partitions. Therefore, the data partition group needs to be split after the migration is completed, and the number of partitions included in the sub-data partition group after the split must be smaller than the number of metadata partitions included in the sub-metadata partition group. The purpose of the split is to make the metadata corresponding to the current metadata partition group always contain metadata describing the data corresponding to the current data partition group. Since some of the metadata partition groups in the above example contain 32 metadata partitions, and some metadata partition groups contain 16 metadata partitions, the number of data partition groups contained in the sub-data partition group obtained after the split may be 16, or 8, or 4, or 2, in short, cannot exceed 16.

4. Garbage recycling

When there is more garbage data in the storage system 100, garbage collection may be started. In this embodiment, garbage collection is performed in units of storage units. Select a storage unit as the object of garbage collection, migrate the valid data in this storage unit to a new storage unit, and then release the storage space occupied by the original storage unit. The selected storage unit needs to meet certain conditions, for example, the junk data contained in the storage unit reaches the first set threshold, or the storage unit is the storage unit that contains the most junk data in the plurality of storage units, or the storage The valid data contained in the unit is lower than the second set threshold, or the storage unit is the storage unit containing the least valid data among the plurality of storage units. For convenience of description, in this embodiment, the selected storage unit for garbage collection is called a first storage unit or a storage unit 1.

With reference to FIG. 3, taking the garbage collection on the storage unit 1 as an example to illustrate a general garbage collection method. The main body of garbage collection is the storage node to which the storage unit 1 belongs (continue to take the first storage node as an example). The first storage node reads valid data from the storage unit 1 and writes the valid data to a new storage unit . Then, mark all data in the storage unit 1 as invalid, and send a delete request to the storage node where each shard is located to delete the shard. Finally, the first storage node also needs to modify the metadata used to describe the data in the storage unit 1. As can be seen from FIG. 3, the metadata corresponding to the metadata partition group 2 and the metadata corresponding to the metadata partition group 1 are metadata used to describe the data in the storage unit 1, and the metadata partition group 2 and the metadata Partition group 1 is located in different storage nodes. Therefore, the first storage node needs to modify the metadata in the two storage nodes respectively. During the modification process, multiple read requests and write requests will be generated between the nodes, which seriously consumes bandwidth resources between the nodes.

Referring to FIG. 4, the garbage collection method of the embodiment of the present invention will be described by taking garbage collection on the storage unit 2 as an example. The execution subject of garbage collection is the storage node to which the storage unit 2 belongs (taking the second storage node as an example). The second storage node reads valid data from the storage unit 2 and writes the valid data into a new storage unit. Then, mark all data in the storage unit 2 as invalid, and send a delete request to the storage node where each shard is located to delete the shard. Finally, the second storage node also needs to modify the metadata describing the data in the storage unit 2. It can be seen from FIG. 4 that the storage unit 2 is only referenced by the metadata partition group 1, that is, only the metadata corresponding to the metadata partition group 1 is used to describe the data in the storage unit 2. Therefore, the second storage node only needs to send a request to the storage node where the metadata partition group 1 is located to modify the metadata. Compared with Example 1, since the second storage node only needs to modify metadata on one storage node, the bandwidth resources between the nodes are greatly saved.

The node expansion method provided in this embodiment is described below in conjunction with a flowchart. Please refer to FIG. 7, which is a method flowchart of a node expansion method. This method is applied to the storage system shown in FIG. 1, which includes multiple first nodes. The first node refers to a node that already exists in the storage system before capacity expansion. For details, refer to the node 104 shown in FIG. 1 or FIG. 2. Each first node may perform the node expansion method according to the steps shown in FIG. 7.

S701: Configure a data partition group and a metadata partition group of the first node. The data partition group includes multiple data partitions, and the metadata partition group includes multiple metadata partitions. The metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group. The subset here has two meanings, one is that the metadata corresponding to the metadata partition group contains metadata describing the data corresponding to the data partition group, and the second is that the metadata partition group contains The number of metadata partitions is larger than the data of the data partitions included in the data partition group. For example, the data partition group contains M data partitions, namely data partition 1, data partition 2, ... data partition M. The metadata partition group includes N metadata partitions, where N is greater than M, namely, metadata partition 1, metadata partition 2, ..., metadata partition M, and metadata partition N. According to the previous description, the metadata corresponding to metadata partition 1 is used to describe the data corresponding to data partition 1, the metadata corresponding to metadata partition 2 is used to describe the data corresponding to data partition 2, and the metadata corresponding to metadata partition M is used to The data corresponding to the data partition M is described. Therefore, the metadata partition group contains all metadata describing data corresponding to the M data partitions. In addition, the metadata partition group also contains metadata describing data corresponding to other data partition groups.

The first node described in S701 is the old node described in the expansion section. In addition, it should be noted that there may be one or more data partition groups of the first node. Similarly, there may be one or more data partition groups of the first node.

S702. When the second node joins the storage system, split the metadata partition group into at least two sub-metadata partition groups. When the first node includes a metadata partition group, this metadata partition group needs to be split into at least two sub-metadata partition groups. When the first node includes multiple metadata partition groups, only part of the metadata partition groups may need to be split, while the remaining metadata partition groups continue to maintain the original metadata partition. As for which metadata partition groups need to be split and how to split, it can be determined according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion. The layout of the expanded metadata partition group includes the number of the sub-metadata partition groups configured for each node in the storage system after the second node joins the storage system, and the sub-metadata partition group includes The number of metadata partitions of the system, the layout of the metadata partition group before capacity expansion includes the number of the metadata partition groups configured by the first node before the second node joins the storage system, and the metadata partition The number of metadata partitions contained in the group. For specific implementation, refer to the description related to FIG. 5 and FIG. 6 in the capacity expansion section.

Splitting this action refers to the change of the mapping relationship in actual implementation. Specifically, there is a mapping relationship between the identifier of the original metadata partition group before the split and the identifier of each metadata partition contained in the metadata partition group, and at least two sub metadata partition group IDs are added after the split , And delete the mapping relationship between the identifier of the metadata partition contained in the original metadata partition group and the identifier of the original metadata partition group, and create a part of the original metadata partition group The mapping relationship between the ID of the metadata partition and the ID of one of the sub-metadata partition groups, and the ID of another part of the metadata partition contained in the original metadata partition group and the ID of the other sub-metadata partition group The mapping relationship between identifiers.

S703: Migrate a sub-metadata partition group and its corresponding metadata to the second node. The second node is the new node described in the expansion section.

The migration of the partition group refers to the change of the attribution relationship. Specifically, the migration of the sub-metadata partition group to the second node specifically refers to the correspondence between the sub-metadata partition group and the first node The relationship is modified to correspond to the second node. The migration of metadata refers to the actual movement of data. Specifically, the migration of metadata corresponding to a sub-metadata partition group to the second node refers to copying the metadata to the second node and deleting all The metadata retained by the first node.

Since the data partition group and the metadata partition group of the first node are configured in S701, the metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the metadata partition group, so Even if the metadata partition group is split into at least two sub-metadata partition groups, the metadata of the data corresponding to the data partition group is still a subset of the metadata corresponding to one of the sub-metadata partition groups. Then when one of the child metadata partition groups and their corresponding metadata is migrated to the second node, the data corresponding to the data partition group is still described by the metadata stored in one node, which avoids The data is modified, especially the metadata is modified on different nodes when garbage collection is performed.

In order to ensure that the metadata of the data corresponding to the data partition group is still a subset of the metadata corresponding to the sub-metadata partition group during the next capacity expansion, after S703, S704 may also be performed: the The data partition group is split into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to the sub-metadata partition group. The definition of split here is the same as the split in S702.

In the node expansion method provided in FIG. 7, the data and the metadata of the data are stored in the same node, but in another scenario, the data and the metadata of the data are stored in different nodes, Then for a node, although it may also include a data partition group and a metadata partition group, the metadata corresponding to the metadata partition group may not be the metadata of the data corresponding to the data partition group, but Metadata of data stored on other nodes. For this scenario, each first node still needs to configure the data partition group and the metadata partition group in the node. The configured metadata partition group contains more metadata partitions than the data partition group The number of data partitions is sufficient. After the second node joins the storage system, each first node then splits the metadata partition group according to the description of S702, and migrates a split child metadata partition group to the second node. Since each first node has configured its data partition group and metadata partition group in this way, after migration, the data corresponding to a data partition group can still be described by the metadata stored in the same node. As a specific example, data is stored in the first node, and metadata of the data is stored in the third node. Then, the first node configures the data partition group corresponding to the data, and the third node configures the metadata partition group corresponding to the metadata. After configuration, the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the configured metadata partition group. When the second node joins the storage system, the third node splits the metadata partition group into at least two sub-metadata partition groups, and divides the first sub-group of the at least two sub-metadata partition groups The metadata partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.

In addition, in the node expansion method provided in FIG. 7, the number of data partitions included in the data partition group is smaller than the number of metadata partitions included in the metadata partition group. In another scenario, the number of data partitions included in the data partition group is equal to the number of metadata partitions included in the metadata partition group. When the number of data partitions included in the data partition group is equal to the number included in the metadata partition group For the number of metadata partitions, if the second node joins the storage system, there is no need to split the metadata partition group, but directly divide a part of the metadata partition groups among the multiple metadata partition groups in the first node and The corresponding metadata is migrated to the second node. Similarly, there are two cases for this scenario. Case 1, if the data and the metadata of the data are stored in the same node, then each first node needs to include the metadata corresponding to the metadata partition group only including the data corresponding to the data partition group in the node Metadata. Case 2, if the data and the metadata of the data are stored in different nodes, then each first node needs to configure the number of metadata partitions contained in the metadata partition group to be equal to the data contained in the data partition group The number of partitions. In either

case

1 or 2, there is no need to split the metadata partition group, and directly migrate a part of the metadata partition groups and corresponding metadata in the multiple metadata partition groups in the node to the second Node. But this scenario is not suitable for nodes that contain only one metadata partition group.

In addition, in various scenarios applicable to the node expansion method provided in this embodiment, the data partition group and its corresponding data do not need to be migrated to the second node. If the second node receives a read request, it can also Its stored metadata finds the physical address of the data to be read, so that the data is read. Since the data volume of the metadata is much smaller than the data volume of the data, avoiding migrating the data to the second node can greatly save the bandwidth between the nodes.

This embodiment also provides a node capacity expansion device. As shown in FIG. 8, FIG. 8 is a schematic structural diagram of the node capacity expansion device. The device includes a configuration module 801, a split module 802, and a migration module 803.

The configuration module 801 is configured to configure a data partition group and a metadata partition group of the first node in the storage system. The data partition group includes multiple data partitions, and the metadata partition group includes multiple metadata partitions. The metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group. Specifically, reference may be made to the description of S701 shown in FIG. 7.

The splitting module 802 is configured to split the metadata partition group into at least two sub-metadata partition groups when the second node joins the storage system. Specifically, please refer to the description of S702 shown in FIG. 7 and the description related to FIG. 5 and FIG. 6 in the capacity expansion part.

The migration module 803 is configured to migrate one sub-metadata partition group and corresponding metadata among the at least two sub-metadata partition groups to the second node. Specifically, reference may be made to the description of S703 shown in FIG. 7.

Optionally, the apparatus further includes an obtaining module 804, configured to obtain a metadata partition group layout after expansion and a metadata partition group layout before expansion, and the metadata partition group layout after expansion includes the second node joining the After the storage system, the number of the sub-metadata partition groups configured for each node in the storage system and the number of meta-data partitions included in the sub-metadata partition group. The layout of the meta-data partition group before capacity expansion includes all The number of the metadata partition groups configured by the first node before the second node joins the storage system, and the number of metadata partitions included in the metadata partition group. The splitting module 802 is specifically configured to split the metadata partition group into at least two sub-metadata partition groups according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion.

Optionally, the splitting module 802 is further configured to split the data partition group into at least two sub-data after the at least one sub-metadata partition group and its corresponding metadata are migrated to the second node A partition group, the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to the sub-metadata partition group.

Optionally, the configuration module 801 is further configured to keep the data corresponding to the data partition group to continue to be stored in the first node when the second node joins the storage system.

This embodiment also provides a storage node. The storage node may be a storage array or a server. When the storage node is a storage array, the storage node includes a storage controller and a storage medium. For the structure of the storage controller, reference may be made to the schematic structural diagram of FIG. 9. When the storage node is a server, reference may also be made to the schematic structural diagram of FIG. 9. Therefore, no matter what kind of device the storage node is, at least the processor 901 and the memory 902 are included. A program 903 is stored in the memory 902. The processor 901, the memory 902 and the communication interface are connected through a system bus and complete communication with each other.

The processor 901 is a single-core or multi-core central processing unit, or a specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention. The memory 902 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), for example, at least one hard disk memory. The memory 902 is used to store computer-executed instructions. Specifically, the program 903 may be included in the computer execution instructions. When the storage node is running, the processor 901 runs the program 903 to execute the method flow of S701-S704 shown in FIG. 7.

The functions of the configuration module 801, the split module 802, the migration module 803, and the acquisition module 804 shown in FIG. 8 described above may be executed by the processor 901 running the program 903, or executed by the processor 901 alone.

In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to the embodiments of the present application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, storage node or data The center transmits to another website, computer, storage node or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available medium integrated storage nodes, data centers, and the like. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, Solid State Disk (SSD)), or the like.

It should be understood that, in the embodiments of the present application, the terms “first” and the like are only for referring to objects, and do not indicate the order of corresponding objects.

Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a division of logical functions. In actual implementation, there may be other divisions, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of the present application essentially or part of the contribution to the existing technology or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to enable a computer device (which may be a personal computer, a storage node, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. The foregoing storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks and other media that can store program codes .

The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A node expansion method in a storage system, characterized in that the storage system includes a first node;

Configuring a data partition group and a metadata partition group of the first node, the data partition group includes a plurality of data partitions, the metadata partition group includes a plurality of metadata partitions, and data elements corresponding to the data partition group The data is a subset of the metadata corresponding to the metadata partition group;

When the second node joins the storage system, split the metadata partition group into at least two sub-metadata partition groups;

The first sub-metadata partition group of the at least two sub-metadata partition groups and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
The method of claim 1, before splitting the metadata partition group into at least two sub-metadata partition groups, further comprising:

Obtain the metadata partition group layout after capacity expansion and the metadata partition group layout before capacity expansion, where,

The layout of the expanded metadata partition group includes: after the second node joins the storage system, the number of the child metadata partition groups configured for each node in the storage system, and, in the first After two nodes join the storage system, the number of metadata partitions included in the sub-metadata partition group,

The layout of the metadata partition group before capacity expansion includes: before the second node joins the storage system, the number of the metadata partition groups configured by the first node, and joining the site at the second node Before the storage system, the number of metadata partitions included in the metadata partition group;

The splitting the metadata partition group into at least two sub-metadata partition groups includes:

According to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion, the metadata partition group is split, and the number of split sub-metadata partition groups is at least two.
The method according to claim 1 or claim 2, wherein the metadata corresponding to the first sub-metadata partition group and the first sub-metadata partition group is migrated to the second node It also includes:

Split the data partition group into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is the metadata corresponding to any one of the sub-metadata partition groups in the at least two sub-metadata partition groups Subset.
The method according to any one of claims 1-3, further comprising:

After the second node joins the storage system, the data corresponding to the data partition group is kept to be stored in the first node.
The method according to claim 1, wherein the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to any one of the at least two sub-metadata partition groups.
A node capacity expansion device, characterized in that the device is located in a storage system and includes:

The configuration module is configured to configure the data partition group and the metadata partition group of the first node in the storage system. The data partition group includes multiple data partitions, and the metadata partition group includes multiple metadata partitions. The metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to the metadata partition group;

A splitting module, configured to split the metadata partition group into at least two sub-metadata partition groups when the second node joins the storage system;

The migration module is configured to migrate the first sub-metadata partition group in the at least two sub-metadata partition groups and the metadata corresponding to the first sub-metadata partition group to the second node.
The device according to claim 6, further comprising an acquisition module,

The obtaining module is configured to obtain the layout of the metadata partition group after the expansion and the layout of the metadata partition group before the expansion. The layout of the metadata partition group after the expansion includes: after the second node joins the storage system, the The number of the sub-metadata partition groups configured for each node in the storage system, and the number of metadata partitions included in the sub-metadata partition group after the second node joins the storage system, the The layout of the metadata partition group before capacity expansion includes: before the second node joins the storage system, the number of the metadata partition group configured by the first node, and joining the storage at the second node Before the system, the number of metadata partitions contained in the metadata partition group;

The splitting module is specifically configured to split the metadata partition group according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion, and the number of split sub-metadata partition groups is at least Two.
The device according to claim 6 or claim 7, wherein the splitting module is further configured to divide the first sub-metadata partition group and the metadata corresponding to the first sub-metadata partition group in the After migrating to the second node, split the data partition group into at least two sub-data partition groups, and metadata of data corresponding to the sub-data partition group is any one of the at least two sub-metadata partition groups A subset of metadata corresponding to the metadata partition group.
The device according to any one of claims 6-8, wherein the configuration module is further configured to keep the data corresponding to the data partition group to continue to be stored in the storage system after the second node joins the storage system In the first node.
A storage node, characterized in that the storage node is located in a storage system, the storage node includes a processor and a memory, a program is stored in the memory, and the processor runs the program to execute the right 1-right 4 The method of any one.
A storage system, characterized in that it includes a first node and a third node;

The first node is used to configure a data partition group of the first node, and the data partition group includes multiple data partitions;

The third node is configured to configure a metadata partition group of the third node, and the metadata partition group includes multiple metadata partitions;

The metadata of the data corresponding to the configured data partition group is a subset of the metadata corresponding to the configured metadata partition group;

When the second node joins the storage system, the third node is also used to split the metadata partition group into at least two sub-metadata partition groups; and the first of the at least two sub-metadata partition groups A sub-metadata partition group and the metadata corresponding to the first sub-metadata partition group are migrated to the second node.
The storage system according to claim 11, wherein:

The third node is also used to obtain the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion. The layout of the metadata partition group after capacity expansion includes: after the second node joins the storage system, The number of the sub-metadata partition groups configured by each node in the storage system, and the number of metadata partitions contained in the sub-metadata partition group after the second node joins the storage system, so The layout of the metadata partition group before capacity expansion includes: before the second node joins the storage system, the number of the metadata partition groups configured by the third node, and joining the second node at the second node Before the storage system, the number of metadata partitions included in the metadata partition group;

When the second node joins the storage system, the third node is specifically configured to perform the metadata partition group according to the layout of the metadata partition group after capacity expansion and the layout of the metadata partition group before capacity expansion Split, the number of split sub-metadata partition groups is at least two.
The storage system according to claim 11 or claim 12, wherein:

The third node is also used to partition the data after the migration of the metadata corresponding to the first sub-metadata partition group and the first sub-metadata partition group to the second node The group is split into at least two sub-data partition groups, and the metadata of the data corresponding to the sub-data partition group is a subset of the metadata corresponding to any one of the sub-metadata partition groups in the at least two sub-metadata partition groups.
The storage system according to any one of claims 11-13, characterized in that

After the second node joins the storage system, the first node is also used to keep the data corresponding to the data partition group to continue to be stored in the first node.
The storage system according to claim 11, wherein the metadata of the data corresponding to the data partition group is a subset of the metadata corresponding to any one of the at least two sub-metadata partition groups.