WO2023125507A1 - 生成块组的方法、装置和设备 - Google Patents

生成块组的方法、装置和设备 Download PDF

Info

Publication number
WO2023125507A1
WO2023125507A1 PCT/CN2022/142246 CN2022142246W WO2023125507A1 WO 2023125507 A1 WO2023125507 A1 WO 2023125507A1 CN 2022142246 W CN2022142246 W CN 2022142246W WO 2023125507 A1 WO2023125507 A1 WO 2023125507A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
stripe
data
target
source
Prior art date
Application number
PCT/CN2022/142246
Other languages
English (en)
French (fr)
Inventor
陈飘
王�锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111666642.1A external-priority patent/CN116414294A/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023125507A1 publication Critical patent/WO2023125507A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers

Definitions

  • the present invention relates to the field of storage technology, in particular to a method, device and system for generating block groups.
  • a widely used storage method is to store data in a storage system including multiple storage devices in the form of block groups, and the block group includes multiple shards that establish relationships with erasure codes (erasure code, EC).
  • EC erasure code
  • the original data is split into n data fragments, which are generated from the m verification fragments and n data fragments, and the association relationship is established in the form of slices.
  • the n+m in the slices Data fragmentation forms a data verification group.
  • the storage space provided by the original storage devices in the storage system may become insufficient.
  • Repeated expansion of the storage system is often faced by enterprises. After expansion, it often means that the storage system can adopt a higher redundancy ratio to improve the utilization rate of the storage space of the storage device.
  • the data stored before the expansion is still in a state of low redundancy ratio. In order to improve the utilization rate of the storage space of the entire storage system.
  • One method provided by the existing technology is: read the data fragments in the low redundancy ratio block group from the storage device, and rewrite the storage device as new data in a high redundancy ratio manner to form a new block group .
  • this method a large number of data fragments need to be read, and the read data fragments need to be written into the storage device again, which brings a burden on the performance of the storage system.
  • a method for generating a block group including: acquiring a first source stripe from a first source block group, and the first source stripe includes: a first data partition that owns a first data slice group Strip unit group, and the first parity strip unit group that has the first verification slice group; obtain the second source strip from the second source block group, and the second source strip includes: owning the second data The second data striping unit group of the fragmentation group, and the second verification striping unit group with the second verification fragmentation group; generate the target striping, and the target striping includes: the target data striping unit group and A target verification striping unit group, wherein: the target data striping unit group points to the first data striping unit group and the second data striping unit group, and the target verification striping unit group has a target Check the slice group, wherein there is a check relationship between the target check slice group and the set composed of the first and second data slice groups; store the target check slice group in the storage device group.
  • the remaining shards can be used to recover the lost shards using the verification algorithm. Therefore, there is a verification relationship between the target verification segment group and the set composed of the first and second data segment groups, which means that the target verification segment group, the first data segment group and the second In the shard collection composed of data shard groups, when a small number of shards are lost, it will not cause real data loss.
  • the remaining shards in the set can recover the lost shards through data reconstruction. The number that can be reconstructed is determined by the verification algorithm. In the verification relationship set composed of n data fragments and m verification fragments, no more than m fragments can be restored at one time.
  • This scheme generates a new block group through the first and second block groups, which is equivalent to realizing the merging of block groups.
  • the protection of the first data fragment group and the second data fragment group can be maintained without relying on the check fragments of the first source block group and the second source block group, so storage can be saved space.
  • This solution can be applied to the scenario of expanding the capacity of the storage system. After the number of storage device groups in the storage system increases, the utilization rate of the storage device group can be improved after the capacity expansion through this solution.
  • both the first source block group and the second source block group are located in the storage device group before the storage system is expanded, where the The step specifically includes: storing the verification fragment of the target stripe in the newly added storage device group after capacity expansion in the storage group.
  • the verification fragment of the target stripe is stored in the newly added storage device group after expansion in the storage group, which can avoid the verification fragmentation and data fragmentation Located on the same storage device group thus further the possibility of data loss.
  • the first source block group is located in the storage device group before the storage system is expanded, and the second source block group is a block group that has not been stored in the storage device group, wherein the storage
  • the step of verifying the segment group of the target specifically includes: storing the verification segment of the target stripe in a newly added storage device group after the storage system is expanded.
  • a read request for a target stripe unit in the target stripe is received; according to the data stripe unit group pointed to by the target data stripe unit, determine the A striping unit corresponding to the striping unit, wherein the determined striping unit belongs to the first data striping unit group or the second data striping unit group; obtain the data partition owned by the determined striping unit piece.
  • This solution provides a process for directly reading data shards.
  • This solution provides a process for reconstructing lost data fragments when data fragments are lost and cannot be read directly.
  • the first aspect introduces the process of merging two source partitions to generate a new partition.
  • a new block group can be generated again by relying on the target stripe generated in the first aspect and the stripes of other block groups.
  • the shard generated by remerging is called a new target shard.
  • the method further includes: obtaining a third source stripe from a third source block group, and the third source stripe includes: owning a third data partition The third data striping unit group of the slice group, and the third parity striping unit group having the third verification slice group; generating a new target strip according to the target strip and the third source strip,
  • the new target striping includes: a new target data striping unit group and a new target verification striping unit group, wherein: the new target data striping unit group points to the first data striping unit group, the second The second data striping unit group and the third data striping unit group; or, the new target data striping unit group points to the target data striping unit group and the third data striping unit group; the new target check
  • the stripe unit group has a new target verification segment group, wherein there is a verification between the new target verification segment group and the data set composed of the data segment group pointed to by the new target data stripe unit group relationship; storing the new target verification
  • a management device for generating a block group includes: an acquisition module, configured to acquire a first source stripe from a first source block group, and the first source stripe includes: owning the first The first data stripe unit group of a data slice group, and the first check stripe unit group with the first check slice group; the acquisition module is also used to acquire the second source block group Two source stripes, the second source stripes include: a second data stripe unit group with a second data fragmentation group, and a second verification stripe unit group with a second verification fragmentation group; generate A module, configured to generate a target stripe, the target stripe comprising: a target data stripe unit group and a target verification stripe unit group, wherein: the target data stripe unit group points to the first data stripe unit group and the second data striping unit group, the target verification striping unit group has a target verification fragmentation group, wherein the target verification fragmentation group and the first and second data fragmentation There is a verification relationship between the sets formed by the
  • a storage management device including: a storage medium for storing program instructions; at least one processor coupled to the storage medium, and the at least one processor is used to execute the first step by running the computer program
  • a computer program product including instructions, which, when run on a computer, causes the computer to execute the solution of the first aspect and the solutions of various possible implementation manners of the first aspect.
  • a computer-readable storage medium including instructions, which, when run on a computer, cause the computer to execute the solution of the first aspect and the solutions of various possible implementation manners of the first aspect.
  • a method for generating block groups comprising: acquiring N source stripes from N source block groups, different source stripes come from different source block groups, wherein each of the The source stripe includes: a data stripe unit group with a data slice group, and a check stripe unit group with a check slice group, the N is equal to 2, or the N is greater than 2; generate the target stripe , the target striping includes: a target data striping unit group and a target verification striping unit group, wherein: the target data striping unit group points to all the source striping unit groups, and the target verification striping unit group
  • the unit group has a target verification segment group, wherein there is a verification relationship between the target verification segment group and the set of all the source strip unit groups; storing the target verification segment group to a storage device Group.
  • This solution can generate new stripes based on more than two block groups at one time. Therefore, the processing power is stronger than the first aspect.
  • a management device for generating block groups includes: an acquisition module, configured to acquire N source stripes from N source block groups, and different source stripes come from different source blocks Group, wherein, each of the source strips includes: a data strip unit group with a data slice group, and a verification strip unit group with a verification slice group, and the N is greater than 2; the generation module uses For generating target stripes, the target stripes include: target data stripe unit groups and target verification stripe unit groups, wherein: the target data stripe unit groups point to the N source stripe unit groups, the The target verification strip unit group has a target verification fragment group, wherein there is a verification relationship between the target verification fragment group and the set of the N data fragment groups; the storage module is used to store The target verifies the slice group to the storage device group.
  • a management device for generating block groups, a storage medium for storing program instructions; and at least one processor coupled to the storage medium, the at least one processor is used for running the computer program , perform the following steps: obtain N source stripes from N source block groups, and different source stripes come from different source block groups, wherein each source stripe includes: data with a data slice group A stripe unit group, and a check stripe unit group with a check slice group, the N is greater than 2; generate a target stripe, and the target stripe includes: a target data stripe unit group and a target check stripe unit group, wherein: the target data striping unit group points to all the source striping unit groups, and the target verification striping unit group has a target verification fragment group, wherein the target verification fragment group There is a verification relationship with the set of all the source stripe unit groups; storing the target verification slice group to the storage device group.
  • a computer program product including instructions, which, when run on a computer, cause the computer to execute the solution of the first aspect and the solutions of various possible implementation manners of the first aspect.
  • a computer-readable storage medium including instructions, which, when run on a computer, cause the computer to execute the solution of the first aspect and the solutions of various possible implementation manners of the first aspect.
  • Figure 1 is a schematic diagram of a centralized storage system.
  • Figure 2 is a schematic diagram of a block group.
  • Fig. 3 is a schematic diagram of storing block groups in a hard disk group.
  • Fig. 4 is a schematic diagram of storage block groups in a hard disk group after hard disk capacity expansion.
  • Fig. 5 is a schematic diagram of the logical relationship of generating block groups.
  • Fig. 6 is a schematic diagram of the logical relationship of generating block groups.
  • Fig. 7 is a flowchart for generating block groups.
  • Fig. 8 is a schematic diagram of the corresponding relationship between the columns of the block group.
  • Fig. 9 is a schematic diagram of the logical relationship of regenerating block groups.
  • Fig. 10 is a schematic diagram of a management device for generating block groups.
  • FIG. 11 is a schematic diagram of a storage system architecture.
  • the embodiments of the present invention can be applied to an append-only centralized storage system or an append-only distributed storage system. Append written data can be deleted, but cannot be updated by overwriting. The embodiments of the present invention are also applicable to non-append-write storage systems.
  • a storage device (storage device or hard disk) provides storage space in the form of chunks, and blocks from multiple storage devices form a chunk group.
  • Each block group includes one or more stripes, blocks in the same stripe have an EC verification relationship, and the blocks in the same block group are distributed in the same storage device.
  • the logical block group can also be regarded as a log (log) of appending write type, such as plog.
  • Figure 1 shows a centralized storage system composed of a single storage device.
  • a distributed storage system may also be formed by multiple storage devices (each storage device includes multiple hard disks).
  • the storage system described in FIG. 1 includes a controller (not shown) and several hard disks 201.
  • the controller provides management functions for stripes, blocks, and block groups, and performs address conversion at different levels.
  • the hard disks 201 provide physical storage. The actual address of the storage space provided by the hard disk 201 is not directly exposed to the application host 100 (referred to as the host 100 for short).
  • the hard disk 201 can be of any type. In this implementation, a solid state disk is used as an example for illustration, but it is also applicable to mechanical hard disks or other types of hard disks.
  • the storage space of the hard disk 201 is logically divided into chunks by the controller to form a storage pool 204 .
  • the storage pool 204 is used to provide storage space to upper layer services, and the storage space actually comes from the hard disk 201 included in the system.
  • a plurality of logical chunks from different hard disks 201 form a chunk group, and the chunk group is the minimum allocation unit of the storage pool 204 .
  • the storage pool 204 can provide one or more block groups to the storage service layer.
  • the storage service layer further virtualizes the storage space provided by the block group into a logical unit (logical unit, LU) 206 for the host 100 to use.
  • LU logical unit
  • Each logical unit has a unique logical unit number (logical unit number, LUN). Since the host 100 can directly perceive the logical unit number of the logical unit, those skilled in the art usually directly use LUN to refer to the logical unit.
  • LUN has a LUN ID for identifying the LUN. The specific location of the data in the LUN can be determined by the start address and the length (length) of the data. For the start address, those skilled in the art usually call it a logical block address (logical block address, LBA). It can be understood that the three factors of LUN ID, LBA and length identify a certain address segment.
  • the data access request generated by the host 100 usually carries LUN ID, LBA and length in the request.
  • the first column of the block group corresponds to a chunk.
  • data fragments 0, 4, and 8 are stored in the same chunk, and data fragments 1, 5, and 9 are stored in another chunk.
  • the chunks are located on different hard disks.
  • the number of blocks contained in a block group depends on which mechanism (also known as redundancy mode) is used to ensure data reliability.
  • the storage system can store data with an erasure coding (EC) verification mechanism.
  • the EC verification mechanism is a RAID technology, which refers to dividing the data to be stored into at least two data fragments, and calculating the verification fragments of the at least two data fragments according to a certain verification algorithm. When a fragment is lost, other data fragments and verification fragments can be used to restore data.
  • a block group includes at least three blocks, and each block is located on a different hard disk 201 .
  • each storage device may provide storage space in a pool as shown in FIG. 1 , or may not provide storage space in a pool.
  • FIG. 2 shows a block group, and each row in the block group represents a slice. Therefore, the block group in FIG. 2 includes three slices: slice A, slice B, and slice C. Each stripe includes multiple stripe units, and each stripe unit corresponds to a chunk. The data shards and verification shards of the same stripe unit form a verification group, and there is a verification relationship between the shards in the verification group.
  • the data (such as files) is split to obtain n data fragments, and the EC algorithm is used to calculate the data fragments to obtain m verification fragments (redundant fragments).
  • Different stripes have the same n and m.
  • These n+m fragments form a verification group, and the storage space utilization rate of this verification group is n/(n+m).
  • the storage space utilization rate of the stripe and block group where the verification is located is also n/(n+m).
  • the process of generating parity slices is called encoding, and the process of recovering lost slices is called decoding.
  • the storage device here is, for example, a hard disk, a storage server, a storage site, and the like.
  • Shards of the same verification group are associated in the form of stripes, each verification group corresponds to a stripe, and each fragment corresponds to a stripe unit in the stripe.
  • the number of storage devices owned by the storage system may increase. For example: when the existing space of the storage system is insufficient, adding a new storage device in the storage system will increase the number of storage devices; number decreases); when a new storage device is used to replace a failed storage device (the number of storage devices increases).
  • the increase of storage devices enables the storage system to support wider stripes.
  • the width of a stripe can be described by stripe width, and a larger stripe bandwidth brings greater utilization of storage space.
  • block group 1 (including 2 stripes) spans all hard disks, so the width of stripe A1 is 4.
  • the ratio of the number of check fragments and data fragments in block group 1 is 1:1, and the storage space utilization rate is 50%.
  • new data can be written to the hard disk in the form of block group 2 (including 2 stripes), and the width of all stripes in block group 2 is 6 , it can be considered that the block width of block group 2 is 6.
  • the ratio of the number of check fragments and data fragments in block group 2 (also known as the redundancy ratio) is 1:2, and the redundancy ratio is lower than that of block group 1, so the storage space utilization rate is 66.7%. It can be seen that the storage space utilization rate of block group 2 is higher than that of block group 1.
  • block group 1 includes two stripes, and each stripe is composed of a data stripe unit group and a checksum stripe unit group.
  • Block group 1 specifically includes: a data stripe unit composed of two data stripe units Group, a verification striping unit group composed of 2 verification striping units.
  • Block group 2 includes 2 stripes, and each stripe is composed of a data stripe unit group and a verification stripe unit group.
  • Block group 2 specifically includes: a data stripe unit group composed of 4 data stripe units, A verification strip unit group consisting of 2 verification strip units.
  • This embodiment proposes an implementation manner in which the stripes with a high redundancy ratio are changed to the stripes with a low redundancy ratio for storage.
  • the block group of the system manages data in the same form as the block group 2, and since the redundancy of the block group 2 is relatively low, the utilization rate of the storage space is improved.
  • a source block group with a small width is used to generate a target block group with a large width.
  • the striping of the target block group includes a verification striping unit part and a data striping unit part, wherein the data striping unit part points to the data striping unit of the source strip, and the verification striping unit part points to the hard disk (or storage device ) in the verification fragment.
  • the target block group only actually owns the verification slices, but not the data slices. Through the pointing relationship, the data slices owned by the source block group can be obtained from the target block group.
  • Step S11 after the storage device in the storage system is added, the management device of the storage system determines the block group A and the block group B for merging together.
  • block group A is an original block group in the storage device
  • block group B is an original block group in the storage system or a block group that has not yet been downloaded.
  • the storage system may also include a management device with computing capability.
  • the management device is a controller (not shown in FIG. 1 ) or a program code. By running the program code, the controller executes the method embodiment of the present invention. each step.
  • the management device can be integrated with the storage device.
  • the same server is both a storage device and a management device.
  • the management device can also be separated from the storage device: for example, the management device is a dedicated management server, and the storage device is a dedicated storage server.
  • a block group includes one or more stripes, and each stripe includes a plurality of stripe units forming an EC check relationship.
  • Each stripe unit corresponds to a slice
  • the data stripe unit corresponds to the data slice
  • the check stripe unit corresponds to the check slice.
  • the check relationship between stripe units is described in the metadata of the block group.
  • each block group can correspond to a separate piece of metadata.
  • the storage device is divided into multiple partitions, each block group corresponds to a partition, and each partition has metadata.
  • each block group corresponds to the same partition, that is, multiple block groups correspond to the same metadata.
  • the combination of all storage devices in the storage system is called a storage device group.
  • a stripe is stored in a storage device group
  • different slices in the same stripe are stored in different storage devices
  • different block groups have different block group IDs.
  • Block group A and block group B have the same number of stripes (when the number of stripes in block group A and block group B is different, you can add all 0 stripes to the block group with insufficient number of stripes, and make the two stripes The number of entries is the same), the data fragments of block group A and the data fragments of block group B can be located in different storage devices. Compared with the stripes in block group A and the stripes in block group B, the number of data fragments may be the same or different, and the same is used as an example in FIG. 5 .
  • Data fragmentation of block group A + data fragmentation of block group B data fragmentation of block group C.
  • the dotted arrow in Figure 6 describes the generation principle of the parity slice of block group C, that is: using the parity slice of block group B and the parity slice of block group A to perform operations, the target block group (block group C) Verification fragmentation.
  • the number of parity slices in block group A and block group B may be the same or different, as long as the parity slices of block group C can be calculated.
  • the block group B and the block group A can be selected according to the block group C, and the block group C can also be determined according to the block group A and the block group B.
  • the method for determining the block group B may be to select from existing block groups in the storage system. For example: when block group C requires that the data fragments of the same stripe are not allowed to be located in the same storage device, then when block group B is selected, it is required that the storage device where the data fragments of block group B are located is the same as that of block group A. There is no duplication of storage devices where data shards reside.
  • the block group B meeting the requirement of the redundancy ratio is generated in real time.
  • the data source of block group B can be from the following situations: (1) when executing this step, according to the newly received write IO, block group B is temporarily generated according to the demand for block group B; (2) received at any time Write IO, but the write IO data is still temporarily stored in the management device, and has not been persistently stored in the hard disk of the storage device group; (3) Garbage collection (GC) is performed on the existing data of the storage device group.
  • GC Garbage collection
  • the data generated block group B In this step, the data slices and check slices (or only check slices) in the block group B have not been stored in the hard disk, but exist in the memory of the management device.
  • the storage device group of the storage system is expanded from 6 storage devices to 8 storage devices, in order to generate 6+2 block group C, in the case of the existing 4+2 block group A, 2+ 2's block group B so that block group C can be generated exactly.
  • the data fragments of block group B may be required to be stored in a storage device different from the storage device where the data fragments of block group A are located, so as to improve reliability.
  • a slice of block group C is obtained jointly from a slice of block group A and a slice of block group B.
  • This process can be called the striping of block group C obtained by merging the striping of block group A and the striping of block group B.
  • block group C is obtained from block group A and block group B in the embodiment of the present invention, specifically: the first partition (stripe A1) of block group A and the first partition of block group B Stripe (stripe B1) obtains the first strip (stripe C1) of block group C; from the second strip (stripe A2) of block group A and the second Stripe B2) gets the 2nd strip of block group C (stripe C2).
  • Step S12 the management device calculates and generates the check slice of the block group C, and stores the calculated check slice into the storage device group.
  • step S12 there are three block groups in the storage system: block group A, block group B, and block group C.
  • block group A and block group B have data shards, but may not actually own verification shards; and block group C, may only actually have verification shards but not actually own data shards.
  • the block group C includes multiple stripes, only the generation process of the stripe C1 is introduced here, and the generation process of the block group 6 is the same as that of the block group 5, so details are not described here.
  • verification slices 41, 42 There are many calculation methods for the verification slices (verification slices 41, 42) of the slice C1, and these methods specifically include: (1) by the check slices (verification slices 42) of the slice A1 13. The check slice 14) and the check slice of the strip B1 (check slice 33, check slice 34) are calculated; (2) the check slice of the slice C1 (check slice 41 , check slice 42) is calculated from the check slice of slice A1 (check slice 13, check slice 14) and the data slice of slice B1 (data slice 31, data slice 34) ; (3) the verification fragment (verification fragment 41, verification fragment 42) of the stripe C1 is composed of the data fragment (data fragment 13, data fragment 14) of the stripe A1 and the verification fragment of the stripe B1.
  • the verification fragments (verification fragments 41, verification fragments 42) of the slice C1 are calculated from the data fragments of the slice A1
  • the slice (data slice 13, data slice 14) and the data slice (data slice 31, data slice 34) of the stripe B1 are calculated.
  • stripe A1 is the data that already exists in the storage device, and the data of stripe B1 comes from write IO or GC that has not been downloaded to the disk, you can use the method (2) or (4) above to directly use stripe B1
  • the data fragments of the data fragments generate the verification fragments of the stripe C1, instead of calculating the verification fragments of the stripe B1, thereby reducing the calculation amount of the management device.
  • the existing shard A1, the data shard in the shard B1, and the verification shard in the newly generated shard C1 together constitute the shard source of the shard C1.
  • shard C1, shard A1, and shard B1 all have complete data shards and verification shards.
  • the slice C1 only owns the verification slice and does not actually own the data slice (slice A1 and slice B1, on the contrary, only own the data slice but not the check slice. slice.
  • slice C1 "indirectly" owns the data slice; slice A1 and slice B1 "indirectly” own the verification slice.
  • the logical address of each slice can be described by the management device through "block group ID+offset" (management device address).
  • block group ID+offset management device address
  • the verification slices of stripe A1 and stripe B1 can be obtained from the hard disk through metadata, and the verification slices of stripe A1 and stripe B1 located in the hard disk can be read into the memory of the management device, and then through the The processor of the device calculates and generates the parity slice of the slice C1.
  • the arrangement of the stripe units conforms to the matrix.
  • a row in the matrix is a stripe, and a column in the matrix consists of stripe cells in the same position in different stripes.
  • the concepts of row and column are specifically introduced below with reference to FIG. 2 and FIG. 5 .
  • the block group shown in FIG. 2 includes a total of 12 strip units distributed in 3 rows and 4 columns.
  • the data slice 11 and the data stripe 15 are owned by the stripe unit in the first column of block group A; the verification slice 43 is owned by the stripe unit in the second row and fifth column of block group C, Row 2, columns 1-4 of block group C do not have shards.
  • This step S12 also includes performing: generating metadata of the block group C.
  • the metadata of the block group C records the ID of the block group associated with the block group C, that is, the block group ID of the block group A and the block group ID of the block group B.
  • Block group C also records: the correspondence between the columns of block group C and the columns of block group A, and the correspondence between the columns of block group C and the columns of block group B.
  • the metadata of block group C records the correspondence between the columns of block group C and the columns of block group A and block group B.
  • such a correspondence may only involve data columns, not check columns.
  • Figure 8 is a schematic diagram of the correspondence between the columns recorded in the metadata.
  • columns 1 and 2 of block group C correspond to columns 1 and 2 of block group A respectively; columns 3 and 4 of block group C They correspond to columns 1 and 2 of block group A, respectively.
  • the check column of block group C is calculated and determined by the check column of block group A and the check column of block group B.
  • This calculation method can be recorded in the metadata, or can be found through the metadata of the block group. Corresponding relationship, this calculation method can be described by the following formula.
  • columns can be marked with chunk IDs, different chunk IDs represent different columns, and the correspondence between chunk IDs is used to describe the correspondence between columns.
  • n rows of block group C correspond to n rows of block group A and n rows of block group B. It should be noted that each row of a block group is an independent stripe, so row and stripe are the same concept.
  • any striping unit in block group C and a striping unit in block group A or a striping unit in block group B which is described by the metadata of block group C, or This correspondence can be found through the metadata of the block group C.
  • the solid line arrows exemplarily introduce the relationship between the striping unit of the first row in block group C, the striping unit of the first row in block group A, and the striping unit of the first row in block group B
  • the corresponding relationship between the block group C and the other partitions between the block group A and the block group B is the same as this logic, so it will not be repeated.
  • the stripe C1 in the block group C corresponds to the stripe A1 in the block group A and the stripe B1 in the block group B.
  • the strip unit in the first column of strip C1 corresponds to the strip unit in the first column of strip A1
  • the strip unit in the second column of strip C1 corresponds to the strip unit in the second column of strip A1
  • the strip unit in the third column of strip C1 corresponds to the strip unit in the first column of strip B1
  • the strip unit in the fourth column of strip C1 corresponds to the strip unit in the second column of strip B1.
  • Block group C does not actually own data shards, but using this correspondence can be used for subsequent reading of data shards, for example: reading the data in the second data stripe unit of stripe C1 in block group C
  • you can use this correspondence to find the stripe unit that actually owns the data slice that is, the second stripe unit of stripe A1 in block group A
  • the metadata of the block group A meta The corresponding relationship between the address of the management device of the striping unit and the logical address in the disk is recorded in the data) and the corresponding data slice (that is, the data slice 12) is read out.
  • the corresponding relationship between the address of the management device and the logical address in the disk of each verification striping unit of block group C is further recorded, and the corresponding verification slice (parity slice) can be directly read out when necessary. check slice 41 and check slice 42).
  • step S12 it is also performed: updating the metadata of the block group A and the block group B.
  • the verification fragments owned by block group A and block group B can be deleted (in order to meet the appearance of the striping, the original verification striping units in block group A and block group B can be retained) , to save storage space; correspondingly, the part of the metadata of the block group A and the block group B about the check fragment needs to be updated.
  • the metadata of the block group A and the block group B records the corresponding relationship between the address of the management device of each stripe unit and the logical address in the disk.
  • the updated metadata indicates that: block group C is the data partition of block group A and block group B.
  • the stripe unit provides verification, in other words, the stripe of block group C provides the verification of the data stripe group in the corresponding stripe of block group A and the verification of the data stripe group in the corresponding stripe of block group B. Therefore, when the storage system receives a read data request from the host and needs to verify the data of block group A/block group B, the metadata of block group A/block group B can be used to know the verification capabilities of the two block groups Provided by block group C.
  • Step S13 When the management device receives the read request from the host, it reads the corresponding data fragment.
  • the following is a brief introduction to the implementation of reading data.
  • the four possibilities will be introduced one by one by taking the data fragments located in slice A1 and slice C1 as examples. If the data slice that the host wants to read is located in slice A2, slice B1 or slice B2, the principle is the same and will not be repeated here.
  • the host uses the host address (LUN ID+LBA+length) to access the storage system, and the storage system converts the host address into the management device address (block group ID+offset).
  • in-disk logical address (disk ID+in-disk LBA) corresponding to the management device address (block group ID+offset) recorded in the metadata of block group A, it is directly read from the corresponding hard disk.
  • the storage system directly obtains the on-disk logical address directly mapped to the address of the management device of the first column of the stripe unit in the stripe A1, and uses the in-disk logical address to directly read the data slice 11 and returned to the host.
  • the read fails, and thus enters the process of downgrading the read data.
  • the redundancy protection capability of block group A is provided by block group C.
  • the data protection capability of stripe A1 of block group A is provided by stripe C1 of block group C. . Therefore, the reading of the data slice 11 is changed to restore the data slice corresponding to the first column of the stripe unit of the stripe C1.
  • the storage system reads the 2-6 column fragments of the stripe C1 according to the metadata of the block group C, uses the verification algorithm to obtain the data fragments 11, and returns the data fragments 11 to the host 100. Since the content involved is similar to the situation described in (4) below, the process can also refer to the description in (4).
  • the block group C does not directly record the on-disk logical addresses of the 2nd to 4th column fragments of the stripe C1, so these 3 column data fragments can be passed through the stripes A1 and Stripe B1 is obtained; the 5th-6th column fragmentation (check fragmentation) of fragmentation C1 is directly obtained from block group 5.
  • Block group C does not directly own data slices, but through the metadata of block group C, it can be known that the stripe units of block group C correspond to stripe units that own data slices, so that the data slices can be read out. piece.
  • the host requests to read is the stripe unit in the first column of stripe C1 in block group C.
  • the metadata of block group C it is recorded that the first row of block group C corresponds to the first row of block group A, and the first column of block group C is the same as the first column of block group A.
  • the striping unit corresponds.
  • the storage system can know from the metadata of block group A: the stripe unit in the first column of block group 1
  • the on-disk logical address corresponding to the management device address of the unit, and then the storage system can obtain the data slice 11 by relying on this list of on-disk logical addresses.
  • the storage system reads data fragments according to the first column stripe unit in stripe A1 of block group A, but the read fails due to the loss of data fragment 11 .
  • the storage system performs data recovery operations.
  • the data fragments in the stripe C1 of the block group C that is, the data fragment 12 and the data fragment 31, in a similar way to read the first column stripe unit of the stripe C1 in the block group C. and data slice 32; then, read out the check slice 41 and the check slice 42 through the on-disk logical address recorded in the metadata of the block group C.
  • the data slice 11 is recovered from the data slice 12, the data slice 31, the data slice 32, the check slice 41 and the check slice 42, and the data slice 11 is returned to the host. Additionally, the data fragments 11 obtained through data recovery can also be stored in the hard disk.
  • the host can split any data in block group A, block group B, and block group C. Read requests for slices can be executed successfully.
  • Step S14 referring to FIG. 9 , the block group C generated in step S12 can be merged with the block group D again to generate a new block group E.
  • the block group D may be a block group generated by the method provided by the embodiment of the present invention, it may be a block group that is not generated by the method provided by the embodiment of the present invention and has been stored originally, or it may be a newly generated block group that has not yet been stored block group to the storage device.
  • steps S11-S13 For the specific process of generating block group E from block group C and block group D and the process of reading data in block group E, please refer to steps S11-S13.
  • the first step in reading block group E When sharding the data of the first stripe unit of the row (stripe E1), it is necessary to first search for the block group C corresponding to the block group E according to the metadata of the block group E; then, according to the metadata of the block group C , look up the block group A corresponding to the block group C; then, obtain the corresponding data slice according to the on-disk logical address recorded in the metadata of the block group A.
  • the principle is the same, and there are only a few adaptive adjustments, such as adding an additional process of searching for the corresponding relationship between management device addresses, for the sake of saving space. No details are given here.
  • the search for block group C can also be avoided, and block group E directly points to block group A and block group B, then when reading the first line of block group E (stripe E1)
  • stripe E1 the block group A corresponding to the block group E is directly searched without searching the block group C, and the corresponding data partition is obtained according to the on-disk logical address recorded in the metadata of the block group A. piece.
  • the verification slice of the block group C can be deleted, thereby reducing search links and saving storage space.
  • the present invention also provides a management device for generating block groups.
  • the management device may be a hardware device (such as a controller, or a management device) or a program product run by at least one processor.
  • the management device includes: obtaining Module 71 , generating module 72 and storage module 73 , when providing external services for reading data, the management device also includes a receiving module 74 .
  • the management device can execute the above method, which is described in detail here only because it has been described in detail in the method embodiments.
  • An acquisition module 71 configured to acquire a first source stripe from a first source block group, the first source stripe comprising: a first data stripe group having a first data fragment group, and a first checksum The first verification stripe group of the fragmentation group; the acquisition module 71 is also used to obtain the second source fragmentation from the second source block group, and the second source fragmentation includes: owning the second data fragmentation The second data stripe group of the group, and the second parity stripe group with the second parity slice group.
  • the generating module 72 is configured to generate a target stripe, and the target stripe includes: a target data stripe group and a target verification stripe group, wherein: the target data stripe group points to the first data stripe group and The second data stripe group, the target verification stripe group has a target verification fragment group, wherein the target verification fragment group and the first and second data fragment groups are composed of A validation relationship exists between collections.
  • a storage module 73 configured to store the target verification slice group in a storage device group.
  • the storage device is, for example, a combination of hard disks or a combination of storage servers.
  • the target data stripe group does not own data fragments.
  • the storage module 73 is specifically configured to: store the parity fragment of the target stripe in In the newly added storage device after capacity expansion in the storage group.
  • the second source block group is a block group that has not been stored in the storage device, wherein the storage module 73 is specifically configured to: divide the target The verification fragments of the bars are stored in the newly added storage device after the storage system is expanded.
  • the second data stripe group is generated by newly written data, or the second data stripe group is generated by garbage collected data; the data fragments of the second source block group are stored in the storage device.
  • redundancy ratios of the first source stripe and the second source stripe are the same.
  • the target stripe belongs to a target block group, and the first source block group, the second source block group and the target block group have the same number of stripes.
  • the storage module 73 is further configured to: according to the corresponding relationship between the target stripe and the first source stripe and the second source stripe, make the data fragment of the first source stripe The data fragments of the source stripe do not need to be written into the target stripe.
  • the management device also includes a receiving module 74 .
  • the receiving module 74 is used to receive a read request for the target stripe unit in the target stripe; , determining a striping unit corresponding to the target striping unit; correspondingly, the acquiring module 71 is further configured to acquire data fragments owned by the determined striping unit.
  • the acquisition module 71 is configured to: read the target verification segment group and read the The lost data fragments are reconstructed from the unlost data fragments in the first source stripe and the second source stripe.
  • the block group where the target stripe is located is a third block group, wherein: the acquiring module 71 is further configured to acquire the source stripe of the target stripe from the third source block group, and the first source The stripes include: the first data stripe group having the first data fragmentation group, and the first verification stripe group having the first verification fragmentation group; the obtaining module 71 is also used to obtain The fourth source stripe is obtained from the block group, and the fourth source stripe includes: the fourth data stripe group with the fourth data slice group, and the fourth checksum stripe with the fourth checksum slice group group; the acquisition module 71 is also used to generate a fourth stripe, and the fourth stripe includes: a fourth data stripe group and a fourth verification stripe group, wherein: the fourth data stripe group Pointing to the target data stripe group and the second data stripe group, the fourth verification stripe group has a fourth verification fragment group, wherein the fourth verification fragment group is the same as the There is a verification relationship between the set formed by the target and the second data fragmentation
  • An acquisition module 71 configured to acquire N source stripes from N source block groups, where different source stripes come from different source block groups, wherein each source stripe includes: A data strip group, and a check strip group with a check slice group, the N is equal to 2, or N is greater than 2;
  • the generation module 72 is used to generate a target slice, and the target slice includes: target data A stripe group and a target verification stripe group, wherein: the target data stripe group points to the N source stripe groups, and the target verification stripe group has a target verification slice group, wherein the There is a verification relationship between the target verification segment group and the set of the N data segment groups;
  • the storage module 73 is configured to store the target verification segment group in a storage device.
  • the storage device introduced in FIG. 1 also has the function of a management device.
  • the storage device includes a storage medium (not shown in the figure), a processor (not shown in the figure), and a hard disk group for persistent storage of block group A verification slices, block group B verification fragments, and block group C verification fragments.
  • FIG. 11 is a schematic diagram of a storage system architecture including a management device.
  • the management device 81 includes a storage medium 811 and a processor 812, the storage medium 811 (such as ROM or SSD) is used to store program instructions; at least one processor 812 is coupled with the storage medium 811, and the at least one processor 811 is used to pass Running the computer program to execute the aforementioned method steps.
  • the management device 81 may be served by the storage device 82, that is to say, the same device plays both the role of the management device and the role of the storage device.
  • the present invention also provides an embodiment of a computer program product containing instructions, which when run on a computer, causes the computer to execute the solution of the first aspect and the solutions of various possible implementations of the first aspect.
  • the present invention also provides an embodiment of a computer-readable storage medium, including instructions, which, when run on a computer, cause the computer to execute the solution of the first aspect and the solutions of various possible implementations of the first aspect.
  • One or more of the above module slices or units can be realized by software, hardware or a combination of both.
  • the software exists in the form of computer program instructions and is stored in the storage device, and the processor can be used to execute the program instructions and implement the above method flow.
  • the processor may include but not limited to at least one of the following: a central processing unit (central processing unit, CPU), a microprocessor, a digital signal processor (DSP), a microcontroller (microcontroller unit, MCU), or artificial intelligence
  • CPU central processing unit
  • DSP digital signal processor
  • MCU microcontroller unit
  • computing devices that run software such as processors, each computing device may include one or more cores for executing software instructions to perform calculations or processing.
  • the processor may be embedded in a SoC (system on chip) or an application specific integrated circuit (ASIC), or may be an independent semiconductor chip.
  • SoC system on chip
  • ASIC application specific integrated circuit
  • the core of the processor is used to execute software instructions for calculation or processing, and can further include necessary hardware accelerators, such as field programmable gate array (field programmable gate array, FPGA), PLD (programmable logic device) , or a logic circuit that implements a dedicated logic operation.
  • FPGA field programmable gate array
  • PLD programmable logic device
  • the hardware can be CPU, microprocessor, DSP, MCU, artificial intelligence processor, ASIC, SoC, FPGA, PLD, dedicated digital circuit, hardware accelerator or non-integrated discrete device Any one or any combination of them, which can run necessary software or not depend on software to execute the above method flow.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Stored Programmes (AREA)

Abstract

一种生成块组的方法,在存储系统扩容后,使用存储系统中原有的多个小宽度的源块组生成大宽度的目标块组,目标块组的分条中,校验分条单元部分指向校验分片,数据分条单元部分不是直接指向数据分片而是指向源分条的数据分条单元,因此目标块组只实际拥有校验分片,不实际拥有数据分片。

Description

生成块组的方法、装置和设备 技术领域
本发明涉及存储技术领域,尤其涉及生成块组的方法、装置和系统。
背景技术
一种广泛应用的存储方式是以块组的方式把数据存储到包括多个存储设备的存储系统中,块组中包括多个以纠删码(erasure code,EC)建立关系的分片。EC技术中,把原始数据拆分成n个数据分片,并由这m个校验分片和n个数据分片生成,以分条的形式建立关联关系,分条中的这n+m个数据分片形成数据校验组。把分条中的不同数据分片存放到不同的存储设备上,当小于m个数据分片丢失时,可以由未丢失的分片对丢失的数据分片进行恢复。
随着企业不断产生新的数据,存储系统中原有的存储设备所能提供的存储空间可能变得不足,对存储系统的反复扩容是企业经常面临的工作。而扩容之后,往往意味着存储系统可以采用更高的冗余比,以提高存储设备存储空间的利用率。然而在扩容以前已经存储的数据仍然处于低冗余比的状态。为了提高整个存储系统存储空间的利用率。
现有技术提供的一种做法是:把低冗余比块组中的数据分片从存储设备中读取出来,当做新数据以高冗余比的方式重新写入存储设备形成新的块组。然而这种做法中,需要读出大量的数据分片,还需要把读出的数据分片再次写入存储设备,给存储系统的性能带来负担。
发明内容
第一方面,提供一种生成块组的方法,包括:从第一源块组中获取第一源分条,所述第一源分条包括:拥有第一数据分片组的第一数据分条单元组,以及拥有第一校验分片组的第一校验分条单元组;从第二源块组中获取第二源分条,所述第二源分条包括:拥有第二数据分片组的第二数据分条单元组,以及拥有第二校验分片组的第二校验分条单元组;生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所述第一数据分条单元组和所述第二数据分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所述第一、第二数据分片组所组成的集合之间存在校验关系;存储所述目标校验分片组到存储设备组。
分条单元组是分条单元的集合,同一个分条中的所有数据分条单元构成这个分条的数据分条单元组,同一个分条中的所有校验分条单元构成这个分条的数据分条单元组。因此,数据分条单元组+校验分条单元组=分条。
存在校验关系的多个分片中,当少数分片丢失时,可以利用余下的分条使用校验算法对丢失的分片进行恢复。因此,目标校验分片组与所述第一、第二数据分片组所组成的集合之间存在校验关系意味着,由目标校验分片组、第一数据分片组以及第二数据分片组所组成的分片集合中,有少量分片丢失时并不会造成数据的真正的丢失,可以由集合中其余分片进行通过数据重构对丢失的分片进行恢复。能够重构的数量由校验算法决定,在n个数据分片和m个校验分片组成的校验关系集合中,至多可以一次性恢复处不超过m个分片。
该方案通过第一、第二块组生成新的块组,从而相当于实现了块组的合并。此外,可以在不再依赖第一源块组的和第二源块组的校验分片的前提下,保持对第一数据分片组与第二数据分片组的保护,因此可以节约存储空间。
该方案可以适用于对存储系统进行扩容的场景,在存储系统中的存储设备组增加后,通过该方案扩容后提高存储设备组的利用率。
在第一方面的第一种可能实现中,所述第一源块组以及所述第二源块组均位于存储系统扩容前的存储设备组,其中,存储所述目标校验分片组的步骤具体包括:将所述目标分条的校验分片存储到所述存储组中扩容后的新增的存储设备组中。
当数据分片均在旧存储设备组中时,把目标分条的校验分片存储到所述存储组中扩容后的新增的存储设备组中,可以避免校验分片与数据分片位于同一个存储设备组,从而进一步数据丢失的可能性。
在第一方面的第二种可能实现中,所述第一源块组位于存储系统扩容前的存储设备组,所述第二源块组是尚未存入存储设备组的块组,其中,存储所述目标校验分片组的步骤具体包括:将所述目标分条的校验分片存储到所述存储系统扩容后新增的存储设备组。
当部分数据分片在旧存储设备组中,另外一部分数据分片(由新写入数据生成,或者由垃圾回收的数据生成)还没有下盘时,把目标分条的校验分片存储到所述存储组中扩容后的新增的存储设备组中,可以避免校验分片与数据分片位于同一个存储设备组,从而进一步数据丢失的可能性。
在第一方面的第三种可能实现中,接收对所述目标分条中目标分条单元的读请求;按照所述目标数据分条单元所指向的数据分条单元组,确定与所述目标分条单元对应的分条单元,其中,被确定的分条单元属于所述第一数据分条单元组或者所述第二数据分条单元组;获取被确定的分条单元所拥有的数据分片。
该方案提供了直接读取数据分片的流程。
在第一方面的第四种可能实现中,当所述第一数据分片组或者所述第二数据分片组中出现数据分片丢失,则读取所述目标校验分片组,以及读取所述第一源分条和所述第二源分条中未丢失的数据分片,重构丢失的数据分片。
该方案提供了数据分片丢失,无法直接读取时,重构丢失的数据分片的流程。
第一方面介绍了了2个源分条合并生成新分条的过程。在第一方面的第五种可能实现中,还可以依靠第一方面所生成的目标分条和其他块组的分条,再次生成新的块组。为了和原有的目标分条进行区分,把再次合并所生成的分条称为新的目标分条。把所述目标分条所在的分条命名为目标块组,所述方法还包括:从第三源块组中获取第三源分条,所述第三源分条包括:拥有第三数据分片组的第三数据分条单元组,以及拥有第三校验分片组的第三校验分条单元组;根据所述目标分条和所述第三源分条生成新目标分条,所述新目标分条包括:新目标数据分条单元组和新目标校验分条单元组,其中:所述新目标数据分条单元组指向所述第一数据分条单元组、所述第二数据分条单元组和第三数据分条单元组;或者,所述新目标数据分条单元组指向目标数据分条单元组和所述第三数据分条单元组;所述新目标校验分条单元组拥有新目标校验分片组,其中,所述新目标校验分片组与所述新目标数据分条单元组指向的数据分片组所组成的数据集合之间存在校验关系;存储所述新目标校验分片组到所述存储设备组。这里的“指向”可以是直接指向,也可以是间接指向(分条单元组A直接指向分条单元组B,分条单元组B直接 指向分条单元组C,那么分条单元组A与分条单元组C之间是间接指向的关系)。
第二方面,提供一种生成块组的管理装置,所述管理装置包括:获取模块,用于从第一源块组中获取第一源分条,所述第一源分条包括:拥有第一数据分片组的第一数据分条单元组,以及拥有第一校验分片组的第一校验分条单元组;所述获取模块,还用于从第二源块组中获取第二源分条,所述第二源分条包括:拥有第二数据分片组的第二数据分条单元组,以及拥有第二校验分片组的第二校验分条单元组;生成模块,用于生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所述第一数据分条单元组和所述第二数据分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所述第一、第二数据分片组所组成的集合之间存在校验关系;存储模块,存储所述目标校验分片组到存储设备组。
第三方面,提供一种存储管理设备,包括:存储介质,用于存储程序指令;至少一个处理器,与所述存储介质耦合,所述至少一个处理器用于通过运行所述计算机程序,执行第一方面以及第一方面各种可能的实现方式。
第四方面,提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行第一方面的方案以及第一方面各种可能实现方式的方案。
第五方面,提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行第一方面的方案以及第一方面各种可能实现方式的方案。
上述第二到第五方面,拥有第一方面以及第一方面各种实现方式的有益效果。
第六方面,提供一种生成块组的方法,所述方法包括:从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条单元组,以及拥有校验分片组的校验分条单元组,所述N等于2,或者所述N大于2;生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所有所述源分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所有所述源分条单元组的集合之间存在校验关系;存储所述目标校验分片组到存储设备组。
该方案可以一次性基于2个以上的块组,生成新的分条。因此处理能力比第一方面更强。
第七方面,提供一种生成块组的管理装置,所述管理装置包括:获取模块,用于从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条单元组,以及拥有校验分片组的校验分条单元组,所述N大于2;生成模块,用于生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所述N个源分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所述N个数据分片组的集合之间存在校验关系;存储模块,用于存储所述目标校验分片组到存储设备组。
第八方面,提供一种生成块组的管理装置,存储介质,用于存储程序指令;还包括至少一个处理器,与所述存储介质耦合,所述至少一个处理器用于通过运行所述计算机程序,执行如下步骤:从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条单元组,以及拥有校验分片组的校验分条单元组,所述N大于2;生成目标分条,所述目标分条包括:目标数据分 条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所有所述源分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所有所述源分条单元组的集合之间存在校验关系;存储所述目标校验分片组到存储设备组。
第九方面,提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行第一方面的方案以及第一方面各种可能实现方式的方案。
第十方面,提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行第一方面的方案以及第一方面各种可能实现方式的方案。
上述第七到第十方面,拥有第六方面以及第六方面各种实现方式的有益效果。
附图说明
图1是集中式存储系统示意图。
图2是块组示意图。
图3是在硬盘组中存储块组的示意图。
图4是在硬盘扩容之后硬盘组中存储块组的示意图。
图5是生成块组的逻辑关系示意图。
图6是生成块组的逻辑关系示意图。
图7是生成块组的流程图。
图8是块组的列之间对应关系示意图。
图9是再次生成块组的逻辑关系示意图。
图10是一种生成块组的管理装置示意图。
图11是一种存储系统架构示意图。
具体实施方式
本发明实施例可以应用于追加写(append only)的集中式存储系统或者追加写的分布式存储系统。追加写的数据可以被删除,但是不能被通过覆盖的方式进行更新。本发明实施例也同样适用于非追加写的存储系统。
在集中式或分布式存储系统中,存储设备(存储设备或者硬盘)以块(chunk)的形式提供存储空间,由来自于多个存储设备的块组成块组(chunk group)。每个块组包括一个或者多个分条,同一个分条内的块之间具有EC校验关系,并且同一个块组的块分布在相同的存储设备。在分布式存储中,也可以认为逻辑块组是追加写类型的日志(log),例如plog。
图1所示是单个存储设备构成的集中式存储系统。在其他实施例中,还可以由多个存储设备构成分布式存储系统(每个存储设备包括多个硬盘)。图1所述的存储系统包含控制器(未图示)和若干个硬盘201,控制器提供对分条、块以及块组的管理功能,以及进行不同层级地址的转换,硬盘201提供物理上的存储空间。硬盘201所提供的存储空间的实际地址并不直接暴露给应用主机100(简称为主机100)。硬盘201可以是任何类型,本实施以固态硬盘为例予以说明,但对机械硬盘或者其他类型的硬盘也同样适用。
硬盘201的存储空间被控制器以逻辑上的块(chunk)构成一个存储池204。存储池204用于向上层服务提供存储空间,所述存储空间实际来源于系统所包含的硬盘201。来自不同硬盘201的多个逻辑chunk组成一个块组(chunk group),所述块组是所述存储池204的最小分配单位。当存储服务层向所述存储池204申请存储空间时,所述存储池204可以提供一 个或多个块组给存储服务层。存储服务层进一步将块组提供的存储空间虚拟化为逻辑单元(logical unit,LU)206提供给主机100使用。每个逻辑单元具有唯一的逻辑单元号(logical unit number,LUN)。由于主机100能直接感知到逻辑单元的逻辑单元号,本领域技术人员通常直接用LUN代指逻辑单元。每个LUN具有LUN ID,用于标识所述LUN。数据在LUN内的所处具体位置可以由起始地址和该数据的长度(length)确定。对于起始地址,本领域技术人员通常称作逻辑块地址(logical block address,LBA)。可以理解的是,LUN ID、LBA和length这三个因素标识了一个确定的地址段。主机100生成的数据访问请求,通常在该请求中携带LUN ID、LBA和length。
图1中,块组的第1列对应一个chunk,例如,数据分片0、4、8被存储到同一个chunk中,数据分片1、5、9被存储到另外一个chunk中,这两个chunk位于不同的硬盘。
一个块组所包含的块的数量取决于采用何种机制(又称为冗余模式)来保证数据可靠性。通常情况下,为了保证数据的可靠性,存储系统可以纠删码(erasure coding,EC)校验机制来存储数据。EC校验机制是一种RAID技术,指将待存储的数据划分为至少两个数据分片,按照一定的校验算法计算所述至少两个数据分片的校验分片,当其中一个数据分片丢失时,可以利用其他数据分片以及校验分片恢复数据。如果采用EC校验机制,那么一个块组至少包含三个块,每个块位于不同硬盘201上。
上面是对集中式场景的描述,本发明同样适用于分布式场景中。当图1中的硬盘201分布于多个存储设备(例如存储服务器)或者站点时,多个存储设备的集合就成为分布式存储系统。在分布式存储系统中,同一个分条的不同分条单元位于不同的存储设备;或者同一个分条的不同分条单元中,有一部分分条单元位于同一个存储设备的不同的硬盘,但是不会所有分条单元集中在同一个存储设备中。在分布式存储系统中,各个存储设备可以像图1一样以池的方式提供存储空间,也可以不以池的方式提供存储空间。
不同的块组用不同的块组ID进行区分,同一个块组可以仅拥有单个分条,也可以拥有多个分条。图2示出了一个块组,块组中的每一行代表一个分条,因此图2中块组包括3个分条:分条A、分条B和分条C。每个分条包括多个分条单元,每个分条单元对应一个分片(chunk)。同一个分条单元的数据分片和校验分片组成一个校验组,校验组内的分片之间有校验关系。
EC技术中,将数据(例如文件)拆分得到n个数据分片,通过对数据分片进行EC算法进行计算,得到m个校验分片(冗余分片),同一个块组中的不同分条拥有相同的n和m。这n+m个分片组成一个校验组,这个校验组的存储空间利用率为n/(n+m),相应的,这个校验所在的分条、块组的存储空间利用率也是n/(n+m)。对于同一个校验组的n+m个分片,当其中任意分片出错(数据分片或者冗余分片)时,均可以通过对应的重构算法恢复出来。生成校验分片的过程被称为编码(encoding),恢复丢失的分片的过程被称为解码(decoding)。
通常情况下,同一个分条的不同的分片被存储到不同的存储设备。这样,当部分存储设备故障而导致数据丢失时,可以使用未故障的存储设备中的分片对丢失的分片进行恢复。这里的存储设备例如是:硬盘、存储服务器、存储站点等。同一个校验组的分片以分条(stripe)的形式形成关联关系,每个校验组对应一个分条,每个分片对应分条中的一个分条单元(stripe unit)。
在使用存储设备存储数据的过程中,存储系统所拥有存储设备的数量可能会增加。例如:当存储系统现有的空间不足时,在存储系统中增加新的存储设备带来的存储设备数量增加;或者,当存储系统中有设备故障后,仅未故障存储设备可用(存储设备的数量减少);当使用新的存储设备替换故障存储设备(存储设备的数量增加)。存储设备的增加使得存储系统得以 支持更宽的分条,分条的宽窄程度可以用分条宽度(stripe width)描述,而更大的分条宽带带来更大的存储空间利用率。
以存储设备为硬盘为例,参见图3,在由4个硬盘组成的硬盘组中,块组1(包括2个分条)跨越所有硬盘,因此分条A1的宽度是4。块组1中校验分片和数据分片的数量比是1:1,存储空间利用率为50%。参见图4,如果为硬盘组中新增加硬盘5和硬盘6,那么新数据可以用块组2(包括2个分条)的方式写入硬盘,块组2中所有分条的宽度均为6,可以认为块组2的块组宽度是6。块组2中校验分片和数据分片的数量比(又称为冗余比)是1:2,冗余比相对于块组1更低,因此存储空间利用率为66.7%。可以看出,块组2的存储空间利用率高于块组1。
如果以分条单元组来描述图4中的块组。那么,块组1包括2个分条,每个分条由数据分条单元组与校验分条单元组构成,块组1具体包括:由2个数据分条单元所构成的数据分条单元组,由2个校验分条单元所构成的校验分条单元组。块组2包括2个分条,每个分条由数据分条单元组与校验分条单元组构成,块组2具体包括:由4个数据分条单元所构成的数据分条单元组,由2个校验分条单元所构成的校验分条单元组。
本实施例提出一种实施方式,把以高冗余比的分条改为以低冗余比的分条进行存储。以图4为例,也就是把存储系统中的块组1与另外一个冗余比为1:1(采用镜像进行数据保护的块组)的块组联合起来,生成新的块组(像块组2一样宽度的块组)。由此,系统的块组按照与块组2相同的形式管理数据,由于块组2的冗余比较低,因此提高了存储空间的利用率。
本实施例中,用于使用小宽度的源块组生成大宽度的目标块组。目标块组的分条中包括校验分条单元部分和数据分条单元部分,其中,数据分条单元部分指向源分条的数据分条单元,校验分条单元部分指向硬盘(或者存储设备)中的校验分片。换言之,目标块组只实际拥有校验分片,不实际拥有数据分片,通过指向关系可以从目标块组中获得源块组所拥有的数据分片。
下面对本发明块组生成过程以及块组中数据分片的读取过程进行介绍。具体参见图5的块组关系示意图,图6的分条关系示意图图,以及图7的流程图。
步骤S11,在存储系统中的存储设备增加之后,存储系统的管理设备确定用于合并在一起的块组A与块组B。其中,块组A是存储设备中原有的块组,块组B是存储系统中原有的块组或者尚未下盘的块组。
存储系统还可以包括拥有计算能力的管理设备,在集中式存储系统中,管理设备是控制器(图1中未示出)或者程序代码,通过运行程序代码,控制器执行本发明方法实施例的各步骤。在分布式存储系统中管理设备可以和存储设备集成在一起,例如:同一个服务器既是存储设备又是管理设备。管理设备也可以和存储设备分离:例如管理设备是专用的管理服务器,存储设备是专用的存储服务器。
块组包括一个或者多个分条,每个分条包括形成EC校验关系的多个分条单元。每个分条单元对应一个分片,数据分条单元对应数据分片,校验分条单元对应校验分片。分条单元之间的校验关系在块组的元数据中进行描述。
在集中式存储中,每个块组可以对应一个单独的元数据。在分布式场景中,把存储设备划分为多个分区,每个块组和一个分区(partition)对应,每个分区拥有元数据。在分布式场景中,当多个块组对应同一个分区,也就是多个块组对应同一个元数据。
把存储系统中的所有存储设备的组合称为存储设备组。当把分条存储到存储设备组中时,同一个分条中的不同分片存储在不同的存储设备中,不同的块组有不同的块组ID。块组 A与块组B拥有相同数量的分条(块组A与块组B的分条数目不同时,可以对分条数目不足的块组补充全0分条,补充之后使得二者的分条数目达到相同),块组A的数据分片与块组B的数据分片可以位于不同的存储设备。块组A的分条与块组B中分条相比,数据分片的数量可以相同或者不同,图5中以相同进行举例。块组A的数据分片+块组B的数据分片=块组C的数据分片。
图6的虚线箭头描述了块组C的校验分片生成原理,也就是:使用块组B的校验分片和块组A的校验分片进行运算,可以生成目标块组(块组C)的校验分片。块组A与块组B中校验分片的数量可以相同或者不同,只要能够计算出块组C的校验分片即可。
本实施例中,可以根据块组C选择块组B与块组A,也可以根据块组A和块组B决定块组C。
先确定块组A,然后根据块组A(例如块组A的分条数量、EC冗余配比(2+2)、分区视图等信息)以及需要拼接形成的块组C(例如块组C的EC冗余配比(2+2)),共同确定块组B。其中,块组B的确定方法可以是从存储系统的已有块组中挑选。例如:当块组C要求不允许有同一个分条的数据分片位于同一个存储设备时,那么在选择块组B时,要求块组B的数据分片所在的存储设备与块组A的数据分片所在的存储设备不存在重复。
也可以是:根据块组C的冗余配比和块组A的冗余配比情况,即时生成符合冗余配比要求的块组B。换言之,块组C中数据块的数量-块组A中每个分条的数据块的数量=块组B中每个分条中数据块的数量。
块组B的数据来源可以由以下几种情况:(1)在执行本步骤时根据新收到的写IO,按照对块组B的需求临时生成块组B;(2)在任何时间收到的写IO,但是写IO的数据还暂存在管理设备的中,尚未被持久化存储到存储设备组的硬盘中;(3)对存储设备组已有数据进行垃圾回收(GC),由被回收的数据生成块组B。在本步骤中,块组B中的数据分片以及校验分片(或者仅校验分片)还没有被存储到硬盘,而是存在于管理设备的内存中。
例如:在存储系统的存储设备组由6个存储设备扩容为8个存储设备时,为了生成6+2的块组C,在已有的4+2块组A的情况下,专门生成2+2的块组B,以便恰好能够生成块组C。并且在后续存储块组B的数据分片到存储设备时,可以要求块组B的数据分片存入与块组A的数据分片所在的存储设备不同的存储设备,以提高可靠性。为了提高可靠性,除了减少或者避免块组A和块组B共用相同的存储设备之外,还可以选择把块组C的校验分片存入新增的存储设备,以避免与块组A、块组B共用存储设备。需要说明的是,同一个分条的数据分片分布到不同的存储设备并不是必须的。例如,当存储设备是存储服务器时,即使同一个分条的多个数据分片位于同一个存储服务器的不同硬盘上,仍然足以提供一定的可靠性,这种行为称为对分条进行折叠。
块组C的一个分条,由块组A的一个分条与块组B的一个分条共同得到。这个过程可以称为块组C的分条由块组A的分条、块组B的分条合并得到。参见图5,本发明实施例中块组C由块组A与块组B得到,具体而言:由块组A的第1个分条(分条A1)与块组B的第1个分条(分条B1)得到了块组C的第1个分条(分条C1);由块组A的第2个分条(分条A2)与块组B的第2个分条(分条B2)得到了块组C的第2个分条(分条C2)。
步骤S12,所述管理设备通过计算生成块组C的校验分片,把计算得到的校验分片存入存储设备组。
经过执行步骤S12之后,存储系统中存在3个块组:块组A、块组B和块组C。其中块组A和块组B拥有数据分片,可以不实际拥有校验分片;而块组C,可以仅实际拥有校验分片而 不实际拥有数据分片。
由于块组C包括多个分条,此处仅以分条C1的生成过程进行介绍,块组6的生成过程与块组5相同,因此不做赘述。
分条C1的校验分片(校验分片41、校验分片42)的计算方法有多种,这些方法具体包括:(1)由分条A1的校验分片(校验分片13、校验分片14)和分条B1的校验分片(校验分片33、校验分片34)计算得到;(2)分条C1的校验分片(校验分片41、校验分片42)由分条A1的校验分片(校验分片13、校验分片14)和分条B1的数据分片(数据分片31、数据分片34)计算得到;(3)分条C1的校验分片(校验分片41、校验分片42)由分条A1的数据分片(数据分片13、数据分片14)和分条B1的校验分片(校验分片33、校验分片34)计算得到;(4)分条C1的校验分片(校验分片41、校验分片42)由分条A1的数据分片(数据分片13、数据分片14)和分条B1的数据分片(数据分片31、数据分片34)计算得到。
当分条A1是已经存在于存储设备中的数据,而分条B1的数据来源于尚未下盘的写IO或者是GC时,可以使用上述(2)或者(4)的方式,直接使用分条B1的数据分片生成分条C1的校验分片,而不用计算出分条B1的校验分片,从而减少管理设备计算量。
已有的分条A1、分条B1中的数据分片,以及新生成的分条C1中的校验分片,共同组成了分条C1的分片来源。对用户来说,分条C1和分条A1、分条B1都拥有完整的数据分片和校验分片。而实际上,在管理设备的记录中,分条C1仅拥有校验分片而并不真实拥有数据分片(分条A1和分条B1与之相反,仅拥有数据分片不拥有校验分片。通过三个分条之间建立的联系,使得分条C1“间接”拥有了数据分片;分条A1和分条B1“间接”拥有了校验分片。需要说明的是,本实施例中,当一个分条单元的管理设备地址与盘内逻辑地址直接对应,则认为这个分条单元、以及这个分条单元所在的分条、这个分条单元所在的块组拥有这个盘内逻辑地址所存储的数据分片,这种直接对应关系往往在块组的元数据中进行记录。
各个分片的逻辑地址可以由管理设备通过“块组ID+偏移量”(管理设备地址)描述。在已进行持久化存储的分条,管理设备地址与盘内逻辑地址(硬盘ID+LBA)存在对应关系,该对应关系由管理设备使用元数据进行记录。因此,通过元数据可以从硬盘中获得分条A1、分条B1的校验分片,把位于硬盘中的分条A1、分条B1的校验分片读到管理设备的内存,然后通过管理设备的处理器计算生成分条C1的校验分片。
对于拥有多个分条的块组,其分条单元的排布符合矩阵。矩阵中的一行是一个分条,矩阵中的一列由不同分条在相同位置的分条单元组成。下面参考图2和图5对行、列的概念进行具体介绍。图2所示块组包括分布于3行、4列的共计12个分条单元。图5中,数据分片11和数据分条15由块组A的第1列分条单元所拥有;校验分片43由位于块组C的第2行第5列的分条单元拥有,块组C的第2行第1-4列不拥有分片。
本步骤S12还包括执行:生成块组C的元数据。块组C的元数据记录了与块组C关联的块组的ID,也就是块组A的块组ID以及块组B的块组ID。块组C还记录了:块组C的列与块组A的列之间的对应关系,块组C的列与块组B的列之间的对应关系。换言之,块组C的元数据记录了块组C的列与块组A、块组B的列之间的对应关系。可选的,这样的对应关系可以只涉及数据列,而不涉及校验列。图8是元数据中所记录的列之间对应关系的示意,本实施例中,块组C的1、2列分别与块组A的1、2列对应;块组C的3、4列分别与块组A的1、2列对应。在这些列之间的对应关系中,可以只记录数据列之间的对应关系,不记录校验列之间的对应关系;在另外一种实现方式中,列之间不是一一单独对应关系,块组C的校验 列由块组A的校验列、块组B的校验列计算确定,这种计算办法可以记录在元数据中,,或者通过块组的元数据可以查找到这种对应关系,这种计算办法可以用下面的公式描述。
f(块组A第3列,块组A第4列,块组B第3列,块组B第4列)=块组C第5列,块组C第6列。
在实际实现上,列可以用chunk ID来标记,不同的chunk ID代表不同的列,用chunk ID之间的对应关系来描述列之间的对应关系。
块组间行与行之间的对应关系可以由块组C的元数据记录。也可以不单独记录,而是缺省为:块组C的n行与块组A的n行、块组B的n行对应。需要说明的是,块组的每一行是一个独立的分条,因此行和分条是相同的概念。
块组C中任一分条单元与块组A中的一个分条单元或块组B中的一个分条单元之间的对应关系,这种对应关系由块组C的元数据所描述,或者通过块组C的元数据可以查找到这种对应关系。按照图6的描述,实线箭头示例性的介绍了块组C中第一行的分条单元与块组A中第一行的分条单元、块组B中第一行的分条单元之间的对应关系;块组C与块组A、块组B之间其余分条之间的对应关系与此逻辑相同,因此不做赘述。
从图6可以看出,块组C中分条C1与块组A的分条A1、块组B的分条B1对应。具体而言,分条C1的第1列分条单元与分条A1的第1列分条单元对应;分条C1的第2列分条单元与分条A1的第2列分条单元对应;分条C1的第3列分条单元与分条B1的第1列分条单元对应;分条C1的第4列分条单元与分条B1的第2列分条单元对应。
块组C并没有真实拥有数据分片,但是使用该对应关系可以用于后续对数据分片的读取,例如:读取块组C中分条C1的第2个数据分条单元中的数据分片时,可以通过该对应关系找到真正拥有该数据分片的分条单元(也就是块组A中分条A1的第2个分条单元),然后按照该块组A的元数据(元数据中记录了分条单元的管理设备地址与盘内逻辑地址的对应关系)读出相应的数据分片(也就是数据分片12)。
块组C的元数据中,还进一步记录了块组C的各个校验分条单元的管理设备地址与盘内逻辑地址的对应关系,在需要时可以直接读出相应的校验分片(校验分片41以及校验分片42)。
本步骤S12中,还执行:更新块组A、块组B的元数据。应用本实施例提供的方案后,可以删除块组A与块组B拥有的校验分片(为了满足分条的外观,可以保留块组A和块组B中原有的校验分条单元),以节约存储空间;相应的,块组A、块组B的元数据中关于校验分片的部分需要被更新。在更新前,块组A、块组B的元数据中记录了各个分条单元的管理设备地址与盘内逻辑地址的对应关系。而更新之后,元数据中关于数据分条单元的部分仍然保持不变;关于校验分条单元的部分,更新后的元数据指示:由块组C为块组A和块组B的数据分条单元提供校验,换言之,由块组C的分条提供块组A对应分条中数据分条组的校验以及块组B对应分条中数据分条组的校验。于是,当存储系统收到主机的读数据请求需要进行块组A/块组B数据校验时,通过块组A/块组B的元数据可以得知,这两个块组的校验能力由块组C提供。
步骤S13:当所述管理设备收到用主机的读请求时,读取相应的数据分片。
下面对读数据的实现简单进行介绍。按照被读取的数据单元所归属的块组,以及数据分片是否丢失,一共存在4种可能。下面以被数据分片位于分条A1、分条C1为例分别对4种可能逐一进行介绍。如果主机欲读的数据分片位于分条A2、分条B1或者分条B2,原理与此相同,不做赘述。主机使用主机地址(LUN ID+LBA+length)对存储系统进行访问,存储系 统把主机地址转换成管理设备地址(块组ID+偏移)。
(1)读取分条A1的数据分片时,被读的数据分片没有丢失。
根据块组A的元数据中所记录的管理设备地址(块组ID+偏移)对应的盘内逻辑地址(盘ID+盘内LBA),直接从相应的硬盘中读取。
例如:从块组A分条A1中读取第一列分条单元中的数据。根据块组A的元数据,存储系统直接获得分条A1的第一列分条单元的管理设备地址所直接映射的盘内逻辑地址,使用该盘内逻辑地址可以直接读取数据分片11并返回给主机。
(2)读取分条A1的数据分片时,被读的数据分片丢失。
以被读的数据分片是分条A1的第一列分条单元中的数据分片为例,由于数据分片11丢失,导致读取失败,因此进入降级读数据的流程。在块组A的元数据中,记录了块组A的冗余保护能力由块组C提供,具体到分条,是由块组C的分条C1提供块组A分条A1的数据保护能力。因此,对数据分片11的读取被变更为恢复分条C1的第1列分条单元所对应的数据分片。存储系统根据块组C的元数据读取分条C1第2-6列分片,使用校验算法,得到数据分片11,并把数据分片11返回给主机100。由于涉及的内容与下面(4)所描述的情况相似,因此该过程同样可以参考(4)中的描述。
参见附图6中实线箭头描述的对应关系,块组C没有直接记录分条C1的第2-4列分片的盘内逻辑地址,因此这3列数据分片可以分别通过分条A1和分条B1获得;分条C1的第5-6列分片(校验分片)直接从块组5获得。
(3)读取分条C1中的数据分片时,被读的数据分片没有丢失。
块组C不直接拥有数据分片,但是,通过块组C的元数据,可以得知块组C的分条单元对应有拥有数据分片的分条单元,从而可以并借此读出数据分片。
例如:主机请求读取的是块组C中分条C1的第一列分条单元。而块组C的元数据中记录了:块组C的第一行分条与块组A的第一行分条对应,块组C的第一列分条单元与块组A的第一列分条单元对应。因此,可以唯一确定分条C1的第一列分条单元与分条A1的第一列分条单元对应,存储系统通过块组A的元数据可以得知:块组1的第一列分条单元的管理设备地址所对应的盘内逻辑地址,然后,存储系统可以依靠这列盘内逻辑地址获得数据分片11。
(4)读取分条C1的数据分片时,被读的数据分片丢失。
例如:主机请求读取的是块组C中分条C1的第一列分条单元。参考上面(3)所描述的过程,存储系统读按照块组A的分条A1中第一列分条单元读取数据分片,但是由于数据分片11的丢失而导致读失败。接下来,存储系统执行数据恢复操作。
首先,按照读取块组C中分条C1的第一列分条单元相类似的方法,读取块组C分条C1中余下的数据分片,也就是数据分片12、数据分片31和数据分片32;接着,通过块组C的元数据所记录的盘内逻辑地址,读出校验分片41和校验分片42。通过数据分片12、数据分片31、数据分片32、校验分片41和校验分片42恢复出数据分片11,并把数据分片11返回给主机。额外的,还可以把通过数据恢复所得到的数据分片11存储到硬盘中。
从以上4种情况的介绍可以看出,只要没有丢失数据分片,或者丢失的数据分片不超过EC算法的校验能力,主机对块组A、块组B、块组C中任意数据分片的读请求均可以被成功执行。
步骤S14,参见图9,通过步骤S12所生成的块组C,可以再次与块组D合并,生成新的块组E。块组D可以是通过本发明实施例提供的方法所生成的块组,可以是非通过本发明实施例提供的方法所生成的、原本存储就已存在的块组,还可以是新生成的尚未存储到存储设 备的块组。块组C与块组D生成块组E的具体过程以及读取块组E中数据的过程,可以参见步骤S11-S13,仅有少数适应性的变化,例如:在读取块组E第一行(分条E1)的第一个分条单元的数据分片时,需要先根据块组E的元数据,查找与块组E所对应的块组C;接着,根据块组C的元数据,查找块组C所对应的块组A;然后,根据块组A的元数据中所记录的盘内逻辑地址获得相应的数据分片。可以看出,和步骤S11-S13所描述的过程相比,原理是相同的,仅存在少数适应性的调整,例如多增加一次查找管理设备地址之间对应关系的过程,处于节约篇幅的原因,此处不作详述。在另外一种实现方式中,还可以避开对块组C的查找,由块组E直接指向块组A以及块组B,那么在读取块组E第一行(分条E1)的第一个分条单元的数据分片时,不用查找块组C,直接查找到与块组E对应的块组A,根据块组A的元数据中所记录的盘内逻辑地址获得相应的数据分片。在这种实现方式中,块组C的校验分片可以删除,从而减少查找环节以及节约存储空间。
参见图10,本发明还提供一种生成块组的管理装置,管理装置可以是硬件设备(例如控制器,或者管理设备)或者由至少一个处理器运行的程序产品,所述管理装置包括:获取模块71、生成模块72以及存储模块73,当对外提供读取数据的服务时,所述管理装置还包括接收模块74。该管理装置可以执行上述方法,由于在方法实施例中已经进行了详细的说明,所以此处仅简单描述。
获取模块71,用于从第一源块组中获取第一源分条,所述第一源分条包括:拥有第一数据分片组的第一数据分条组,以及拥有第一校验分片组的第一校验分条组;所述获取模块71,还用于从第二源块组中获取第二源分条,所述第二源分条包括:拥有第二数据分片组的第二数据分条组,以及拥有第二校验分片组的第二校验分条组。
生成模块72,用于生成目标分条,所述目标分条包括:目标数据分条组和目标校验分条组,其中:所述目标数据分条组指向所述第一数据分条组和所述第二数据分条组,所述目标校验分条组拥有目标校验分片组,其中,所述目标校验分片组与所述第一、第二数据分片组所组成的集合之间存在校验关系。
存储模块73,存储所述目标校验分片组到存储设备组。所述存储设备例如是硬盘的组合或者存储服务器的组合。
其中,所述目标数据分条组不拥有数据分片。
当所述第一源块组以及所述第二源块组均位于存储系统扩容前的存储设备,其中,所述存储模块73具体用于:将所述目标分条的校验分片存储到所述存储组中扩容后的新增的存储设备中。
当所述第一源块组位于存储系统扩容前的存储设备,所述第二源块组是尚未存入存储设备的块组,其中,所述存储模块73具体用于:将所述目标分条的校验分片存储到所述存储系统扩容后新增的存储设备。在这种情况下,所述第二数据分条组由新写入数据生成,或者所述第二数据分条组由垃圾回收的数据生成;存储所述第二源块组的数据分片到所述存储设备。
可选的,所述第一源分条与所述第二源分条的冗余比相同。
可选的,所述目标分条属于目标块组,所述第一源块组、所述第二源块组以及所述目标块组拥有相同数量的分条。
所述存储模块73还用于:根据所述目标分条与所述第一源分条、第二源分条的对应关系,使得所述第一源分条的数据分片和所述第二源分条的数据分片不需要写入所述目标分条中。
所述管理装置还包括接收模块74。所述接收模块74用于接收对所述目标分条中目标分条单元的读请求;相应的,所述获取模块71,还用于按照所述目标数据分条单元所指向的数据分条组,确定与所述目标分条单元对应的分条单元;相应的,所述获取模块71,还用于获取确定的分条单元所拥有的数据分片。
所述获取模块71用于:当所述第一数据分片组或者所述第二数据分片组中出现数据分片丢失,则读取所述目标校验分片组,以及读取所述第一源分条和所述第二源分条中未丢失的数据分片,重构丢失的数据分片。
所述目标分条所在的块组是第三块组,其中:所述获取模块71,还用于从所述第三源块组中获取所述目标分条源分条,所述第一源分条包括:拥有第一数据分片组的第一数据分条组,以及拥有第一校验分片组的第一校验分条组;所述获取模块71,还用于从第四源块组中获取第四源分条,所述第四源分条包括:拥有第四数据分片组的第四数据分条组,以及拥有第四校验分片组的第四校验分条组;所述获取模块71,还用于生成第四分条,所述第四分条包括:第四数据分条组和第四校验分条组,其中:所述第四数据分条组指向所述目标数据分条组和所述第二数据分条组,所述第四校验分条组拥有第四校验分片组,其中,所述第四校验分片组与所述目标、第二数据分片组所组成的集合之间存在校验关系;所述存储模块73,还用于存储所述第四校验分片组到存储设备。
关于管理装置中各模块的功能,还有另外一种描述方式。获取模块71,用于从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条组,以及拥有校验分片组的校验分条组,所述N等于2,或者N大于2;生成模块72,用于生成目标分条,所述目标分条包括:目标数据分条组和目标校验分条组,其中:所述目标数据分条组指向所述N个源分条组,所述目标校验分条组拥有目标校验分片组,其中,所述目标校验分片组与所述N个数据分片组的集合之间存在校验关系;存储模块73,用于存储所述目标校验分片组到存储设备。在图1所介绍的存储设备中同时兼有管理设备的功能。存储设备包括存储介质(未图示)、处理器(未图示)以及持久化存储块组A校验分片、块组B校验分片,块组C校验分片的硬盘组。
图11是一种包括管理设备的存储系统架构示意图。管理设备81包括存储介质811和处理器812,存储介质811(例如ROM或者SSD)用于存储程序指令;至少一个处理器812与所述存储介质811耦合,所述至少一个处理器811用于通过运行所述计算机程序,执行前述的方法步骤。在一些情况下,管理设备81可以由存储设备82兼任,也就是说同一个设备既扮演管理设备的角色又扮演存储设备的角色。
本发明还提供一种包含指令的计算机程序产品的实施方式,当其在计算机上运行时,使得计算机执行第一方面的方案以及第一方面各种可能实现方式的方案。
本发明还提供一种计算机可读存储介质的实施方式,包括指令,当其在计算机上运行时,使得计算机执行第一方面的方案以及第一方面各种可能实现方式的方案。
以上模分片或单元的一个或多个可以软件、硬件或二者结合来实现。当以上任一模分片或单元以软件实现的时候,所述软件以计算机程序指令的方式存在,并被存储在存储设备中,处理器可以用于执行所述程序指令并实现以上方法流程。所述处理器可以包括但不限于以下至少一种:中央处理单元(central processing unit,CPU)、微处理器、数字信号处理器(DSP)、微控制器(microcontroller unit,MCU)、或人工智能处理器等各类运行软件的计算设备,每种计算设备可包括一个或多个用于执行软件指令以进行运算或处理的核。该处理器可以内置于SoC(片上系统)或专用集成电路(application specific integrated circuit,ASIC), 也可是一个独立的半导体芯片。该处理器内处理用于执行软件指令以进行运算或处理的核外,还可进一步包括必要的硬件加速器,如现场可编程门阵列(field programmable gate array,FPGA)、PLD(可编程逻辑器件)、或者实现专用逻辑运算的逻辑电路。
当以上模块或单元以硬件实现的时候,该硬件可以是CPU、微处理器、DSP、MCU、人工智能处理器、ASIC、SoC、FPGA、PLD、专用数字电路、硬件加速器或非集成的分立器件中的任一个或任一组合,其可以运行必要的软件或不依赖于软件以执行以上方法流程。

Claims (27)

  1. 一种生成块组的方法,其特征在于,所述方法包括:
    从第一源块组中获取第一源分条,所述第一源分条包括:拥有第一数据分片组的第一数据分条单元组,以及拥有第一校验分片组的第一校验分条单元组;
    从第二源块组中获取第二源分条,所述第二源分条包括:拥有第二数据分片组的第二数据分条单元组,以及拥有第二校验分片组的第二校验分条单元组;
    生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所述第一数据分条单元组和所述第二数据分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所述第一、第二数据分片组所组成的分片集合之间存在校验关系;
    存储所述目标校验分片组到存储设备组。
  2. 根据权利要求1所述的块组生成方法,其特征在于:
    所述目标数据分条单元组不拥有数据分片。
  3. 根据权利要求1所述的方法,其特征在于:
    所述第一源块组以及所述第二源块组均位于存储系统扩容前的存储设备组;或者
    所述第一源块组位于存储系统扩容前的存储设备组,所述第二源块组是尚未存入存储设备组的块组。
  4. 根据权利要求3所述的方法,其特征在于,当所述第二源块组是尚未存入存储设备组的块组,则所述第二数据分条单元组由新写入数据生成,或者所述第二数据分条单元组由垃圾回收的数据生成,所述方法还包括:
    存储所述第二源块组的数据分片到所述存储设备组。
  5. 根据权利要求1所述的方法,其特征在于,其中,存储所述目标校验分片组的步骤具体包括:
    存储所述目标分条的校验分片到所述存储系统扩容后新增的存储设备组。
  6. 根据权利要求1所述的方法,其特征在于,所述第一源分条与所述第二源分条的冗余比相同。
  7. 根据权利要求1所述的方法,其特征在于:
    所述目标分条属于目标块组,所述第一源块组、所述第二源块组以及所述目标块组拥有相同数量的分条。
  8. 根据权利要求1所述的方法,其特征在于,还包括:
    根据所述目标分条与所述第一源分条、第二源分条的对应关系,使得所述第一源分条的数据分片和所述第二源分条的数据分片不需要写入所述目标分条中。
  9. 根据权利要求1所述的方法,其特征在于,还包括:
    接收对所述目标分条中目标分条单元的读请求;
    按照所述目标数据分条单元所指向的数据分条单元组,确定与所述目标分条单元对应的分条单元,其中,被确定的分条单元属于所述第一数据分条单元组或者所述第二数据分条单元组;
    获取被确定的分条单元所拥有的数据分片。
  10. 根据权利要求1所述的方法,其特征在于,还包括:
    当所述第一数据分片组或者所述第二数据分片组中出现数据分片丢失,则读取所述目标校验分片组,以及读取所述第一源分条和所述第二源分条中未丢失的数据分片,重 构丢失的数据分片。
  11. 根据权利要求1所述的方法,其特征在于,所述目标分条属于目标块组,所述方法还包括:
    从第三源块组中获取第三源分条,所述第三源分条包括:拥有第三数据分片组的第三数据分条单元组,以及拥有第三校验分片组的第三校验分条单元组;
    根据所述目标分条和所述第三源分条生成新目标分条,所述新目标分条包括:新目标数据分条单元组和新目标校验分条单元组,其中:
    所述新目标数据分条单元组指向所述第一数据分条单元组、所述第二数据分条单元组和第三数据分条单元组;或者,所述新目标数据分条单元组指向目标数据分条单元组和所述第三数据分条单元组;
    所述新目标校验分条单元组拥有新目标校验分片组,其中,所述新目标校验分片组与所述新目标数据分条单元组指向的数据分片组所组成的数据集合之间存在校验关系;
    存储所述新目标校验分片组到所述存储设备组。
  12. 一种生成块组的管理装置,其特征在于,所述管理装置包括:
    获取模块,用于从第一源块组中获取第一源分条,所述第一源分条包括:拥有第一数据分片组的第一数据分条单元组,以及拥有第一校验分片组的第一校验分条单元组;
    所述获取模块,还用于从第二源块组中获取第二源分条,所述第二源分条包括:拥有第二数据分片组的第二数据分条单元组,以及拥有第二校验分片组的第二校验分条单元组;
    生成模块,用于生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所述第一数据分条单元组和所述第二数据分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所述第一、第二数据分片组所组成的集合之间存在校验关系;
    存储模块,存储所述目标校验分片组到存储设备组。
  13. 根据权利要求12所述的块组生成管理装置,其特征在于:
    所述目标数据分条单元组不拥有数据分片。
  14. 根据权利要求12所述的管理装置,其特征在于:
    所述第一源块组以及所述第二源块组均位于存储系统扩容前的存储设备组;或者
    所述第一源块组位于存储系统扩容前的存储设备组,所述第二源块组是尚未存入存储设备组的块组。
  15. 根据权利要求12所述的管理装置,其特征在于,所述存储模块具体用于:
    存储所述目标分条的校验分片到所述存储系统扩容后新增的存储设备组。
  16. 根据权利要求4所述的管理装置,当所述第二源块组是尚未存入存储设备组的块组,则所述第二数据分条单元组由新写入数据生成,或者所述第二数据分条单元组由垃圾回收的数据生成,所述存储模块还用于:
    存储所述第二源块组的数据分片到所述存储设备组。
  17. 根据权利要求12所述的管理装置,其特征在于,所述第一源分条与所述第二源分条的冗余比相同。
  18. 根据权利要求12所述的管理装置,其特征在于:
    所述目标分条属于目标块组,所述第一源块组、所述第二源块组以及所述目标块组拥有相同数量的分条。
  19. 根据权利要求12所述的管理装置,其特征在于,所述存储模块还用于:
    根据所述目标分条与所述第一源分条、第二源分条的对应关系,使得所述第一源分条的数据分片和所述第二源分条的数据分片不需要写入所述目标分条中。
  20. 根据权利要求12所述的管理装置,其特征在于,所述管理装置还包括:
    接收模块,用于接收对所述目标分条中目标分条单元的读请求;
    所述获取模块,还用于按照所述目标数据分条单元所指向的数据分条单元组,确定与所述目标分条单元对应的分条单元,其中,被确定的分条单元属于所述第一数据分条单元组或者所述第二数据分条单元组;
    所述获取模块,还用于获取被确定的分条单元所拥有的数据分片。
  21. 根据权利要求12所述的管理装置,其特征在于,所述获取模块用于:
    当所述第一数据分片组或者所述第二数据分片组中出现数据分片丢失,则读取所述目标校验分片组,以及读取所述第一源分条和所述第二源分条中未丢失的数据分片,重构丢失的数据分片。
  22. 根据权利要求12所述的管理装置,其特征在于,所述目标分条属于目标块组,其中:
    所述获取模块,还用于从第三源块组中获取第三源分条,所述第三源分条包括:拥有第三数据分片组的第三数据分条单元组,以及拥有第三校验分片组的第三校验分条单元组;
    所述获取模块,还用于根据所述目标分条和所述第三源分条生成新目标分条,所述新目标分条包括:新目标数据分条单元组和新目标校验分条单元组,其中:
    所述新目标数据分条单元组指向所述第一数据分条单元组、所述第二数据分条单元组和第三数据分条单元组;或者,所述新目标数据分条单元组指向目标数据分条单元组和所述第三数据分条单元组;
    所述新目标校验分条单元组拥有新目标校验分片组,其中,所述新目标校验分片组与所述新目标数据分条单元组指向的数据分片组所组;
    所述存储模块,还用于存储所述第四校验分片组到存储设备组。
  23. 根据权利要求12所述的管理装置,其特征在于,所述存储设备组包括:
    多个硬盘或者多个存储服务器。
  24. 一种存储管理设备,包括:
    存储介质,用于存储程序指令;
    至少一个处理器,与所述存储介质耦合,所述至少一个处理器用于通过运行所述计算机程序,执行权利要求1-11任意一项所述的方法。
  25. 一种生成块组的方法,其特征在于,所述方法包括:
    从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条单元组,以及拥有校验分片组的校验分条单元组,所述N大于2;
    生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所有所述源分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所有所述源分条单元组的集合之间存在校验关系;
    存储所述目标校验分片组到存储设备组。
  26. 一种生成块组的管理装置,其特征在于,所述管理装置包括:
    获取模块,用于从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条单元组,以及拥有校验分片组的校验分条单元组,所述N大于2;
    生成模块,用于生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所述N个源分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所述N个数据分片组的集合之间存在校验关系;
    存储模块,用于存储所述目标校验分片组到存储设备组。
  27. 一种存储管理设备,包括:
    存储介质,用于存储程序指令;
    至少一个处理器,与所述存储介质耦合,所述至少一个处理器用于通过运行所述计算机程序,执行如下步骤:从N个源块组中获取N个源分条,不同的源分条来自于不同的源块组,其中,每个所述源分条包括:拥有数据分片组的数据分条单元组,以及拥有校验分片组的校验分条单元组,所述N大于2;
    生成目标分条,所述目标分条包括:目标数据分条单元组和目标校验分条单元组,其中:所述目标数据分条单元组指向所有所述源分条单元组,所述目标校验分条单元组拥有目标校验分片组,其中,所述目标校验分片组与所有所述源分条单元组的集合之间存在校验关系;
    存储所述目标校验分片组到存储设备组。
PCT/CN2022/142246 2021-12-29 2022-12-27 生成块组的方法、装置和设备 WO2023125507A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202111643049 2021-12-29
CN202111643049.5 2021-12-29
CN202111666642.1 2021-12-31
CN202111666642.1A CN116414294A (zh) 2021-12-29 2021-12-31 生成块组的方法、装置和设备

Publications (1)

Publication Number Publication Date
WO2023125507A1 true WO2023125507A1 (zh) 2023-07-06

Family

ID=86997897

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142246 WO2023125507A1 (zh) 2021-12-29 2022-12-27 生成块组的方法、装置和设备

Country Status (1)

Country Link
WO (1) WO2023125507A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630423A (zh) * 2015-12-25 2016-06-01 华中科技大学 一种基于数据缓存的纠删码集群存储扩容方法
WO2016180055A1 (zh) * 2015-05-12 2016-11-17 中兴通讯股份有限公司 数据存储、读取的方法、装置及系统
CN107436733A (zh) * 2017-06-29 2017-12-05 华为技术有限公司 分片管理方法和分片管理装置
CN109995813A (zh) * 2017-12-29 2019-07-09 杭州华为数字技术有限公司 一种分区扩展方法、数据存储方法及装置
CN112130768A (zh) * 2020-09-18 2020-12-25 苏州浪潮智能科技有限公司 磁盘阵列在线扩容方法、装置及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016180055A1 (zh) * 2015-05-12 2016-11-17 中兴通讯股份有限公司 数据存储、读取的方法、装置及系统
CN105630423A (zh) * 2015-12-25 2016-06-01 华中科技大学 一种基于数据缓存的纠删码集群存储扩容方法
CN107436733A (zh) * 2017-06-29 2017-12-05 华为技术有限公司 分片管理方法和分片管理装置
CN109995813A (zh) * 2017-12-29 2019-07-09 杭州华为数字技术有限公司 一种分区扩展方法、数据存储方法及装置
CN112130768A (zh) * 2020-09-18 2020-12-25 苏州浪潮智能科技有限公司 磁盘阵列在线扩容方法、装置及计算机可读存储介质

Similar Documents

Publication Publication Date Title
JP6294518B2 (ja) 不揮発性メモリシステムにおける同期ミラーリング
US10372537B2 (en) Elastic metadata and multiple tray allocation
US10019317B2 (en) Parity protection for data chunks in an object storage system
US9298386B2 (en) System and method for improved placement of blocks in a deduplication-erasure code environment
US7814273B2 (en) Dynamically expandable and contractible fault-tolerant storage system permitting variously sized storage devices and method
US6996689B2 (en) Systems and methods for striped storage migration
US9990263B1 (en) Efficient use of spare device(s) associated with a group of devices
CN114415976B (zh) 一种分布式数据存储系统与方法
CN112988067B (zh) 数据更新技术
US11003558B2 (en) Systems and methods for sequential resilvering
WO2019080370A1 (zh) 一种数据读写方法、装置和存储服务器
CN110427156B (zh) 一种基于分片的mbr的并行读方法
US11537330B2 (en) Selectively improving raid operations latency
WO2023197937A1 (zh) 数据处理方法及其装置、存储介质、计算机程序产品
WO2023125507A1 (zh) 生成块组的方法、装置和设备
CN116414294A (zh) 生成块组的方法、装置和设备
CN117370067B (zh) 一种大规模对象存储系统的数据布局和编码方法
US20210034472A1 (en) Method and system for any-point-in-time recovery within a continuous data protection software-defined storage
CN115809011A (zh) 一种存储系统中数据重构方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914760

Country of ref document: EP

Kind code of ref document: A1