CN117370067B

CN117370067B - Data layout and coding method of large-scale object storage system

Info

Publication number: CN117370067B
Application number: CN202311669991.8A
Authority: CN
Inventors: 徐光平; 郑峰; 田毅; 杨磊; 杨洪章; 郭江谱
Original assignee: Raycom Joint Creation Tianjin Information Technology Co ltd; Tianjin University of Technology
Current assignee: Raycom Joint Creation Tianjin Information Technology Co ltd; Tianjin University of Technology
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-04-12
Anticipated expiration: 2043-12-07
Also published as: CN117370067A

Abstract

The invention provides a data layout and coding method of a large-scale object storage system, which logically modifies the size of an object file into an even number and cuts the even number into a series of data blocks with geometric sequence size; distributing the data blocks to different data nodes, and recording the distribution condition of the data blocks; performing parity check coding on the data block to generate a local check block and a global check block; the client executes the Get command, reads data on the corresponding data node, and merges the data into the original object file; if the node where the data block is located has a fault, the data block is recovered through the local check block and then sent to the client; and recovering through the global check block when the local check block cannot be recovered. The invention has the beneficial effects that: the method avoids the additional overhead caused by mixed coding, fully exerts the high recovery performance of the regenerated code, has lower repair cost when degrading and reading, reduces the degradation and reading delay and improves the recovery efficiency.

Description

Data layout and coding method of large-scale object storage system

Technical Field

The invention belongs to the field of distributed storage, and particularly relates to a data layout and coding method of a large-scale object storage system.

Background

Reliability of data is a very important item of research in distributed storage systems. A system such as Hystack, amazonS would handle a large amount of data every day. These data, referred to as BLOB (BinaryLargeObject) objects, present a significant challenge to the scalability and stability of distributed storage systems. The existing solutions are mostly realized by introducing data redundancy technology including a multi-copy technology and an erasure code technology. The multi-copy technology is to copy the same data to different nodes to improve the reliability of the data, and the GFS and the HDFS like google adopt the copy technology. For the most common three-copy strategy at present, the storage cost which is three times of that of the original data is obviously required, and the strategy is not applicable to the mass data of ZB level in the large data age at present.

Erasure coding is now widely used in existing distributed storage systems, such as Ceph, F4, to improve system reliability. Erasure coding techniques can provide system reliability, such as RScode, that is comparable to replica techniques at lower storage overhead compared to replica techniques. However, the conventional erasure coding technology needs to read out k data blocks including the check block completely to repair one data block, which has great repair overhead. In order to reduce the repair cost of the erasure code, a network coding regenerated code is also provided, wherein the regenerated code is a special erasure code, and the regenerated code is provided for solving the problem of low repair performance caused by high repair cost of the traditional erasure code. When a node fails, the regenerated code can repair the failed data only by downloading a small amount of data from other nodes, so that the cost in repair can be effectively reduced. Like MSR codes, optimal repair performance can be theoretically achieved while having equivalent storage efficiency and reliability to RS codes. However, the characteristic that the regenerated code takes the block as the repair granularity, so that the regenerated code cannot exert the advantage of low repair cost when being applied to the object storage system.

In the target storage system, the RS code uses bytes as coding units, and the reproduction code uses data blocks as coding units. The regenerated code uses the block as the characteristic of repairing granularity, and the selection of the block size has a radical influence on the actual performance. In geometric partitioning, it is mentioned that in order to reduce repair costs, regenerated codes reduce I/O by introducing finer granularity sub-blocks and more complex connections between sub-blocks. The data block is organized into a plurality of sub-blocks when encoded. When a node fails, only a small portion of non-contiguous sub-blocks need to be read from other disks for decoding during data repair. This scattered disk access pattern compromises recovery efficiency; therefore, a large block is required to exert the recovery performance of the regenerated code. However, unlike the repair of RS codes with byte granularity, the regeneration codes require data to be stored in large blocks, which causes serious read-amplification phenomenon, and can read redundant data when degrading the read object, thereby seriously damaging the degraded read.

Disclosure of Invention

In view of the foregoing, the present invention aims to propose a data layout and encoding method for a large-scale object storage system, so as to solve at least one of the above-mentioned technical problems.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a data layout and encoding method for a large-scale object storage system, comprising:

the client executes a Put command to upload the object file, the storage system receives the object file, logically modifies the size of the object file into an even number, and cuts the object file into a series of data blocks with geometric sequence sizes;

distributing the data blocks to different data nodes, and recording the distribution condition of the data blocks;

the storage system divides all data nodes into a plurality of local groups, and performs regenerative code encoding on data blocks in the data nodes in the local groups to generate local check blocks;

the storage system executes the regeneration code coding to all the data blocks to generate a global check block;

the client executes the Get command, reads data on the corresponding data node, and merges the data into the original object file; if the node where the data block is located has a fault, the data block is recovered through the local check block and then sent to the client;

and recovering through the global check block when the local check block cannot be recovered.

Further, the process of logically modifying the object file size to an even number includes:

the storage system calculates the size of the object file and judges whether the value of the size of the object file is even;

if yes, dividing the object file into a series of data blocks with geometric sequence size according to a geometric dividing algorithm;

if not, the nearest even number larger than the size of the object file is taken as the logical size of the object file.

Further, the process of segmenting the object file into a plurality of geometrically sized data blocks includes:

the object file is filled according to the logical size and segmented into a series of geometrically sequence sized data blocks according to a geometric segmentation algorithm.

Further, the process of distributing the data blocks to different data nodes and recording the distribution condition of the data blocks comprises the following steps:

creating an idle disk list, distributing corresponding disk space for each data block according to a load balancing algorithm, and writing the sequence of the disk space into an index;

after the writing of the index is completed, updating the free disk list, and deleting the sequence of the allocated disk space;

each data block is stored into a corresponding disk space according to the index.

Further, the storage system performs a regeneration code encoding on the data blocks in the data node, and the process of generating the local check block includes:

the client sends a command for requesting the storage system to generate check data to the storage system, and a background process of the storage system receives the command sent by the client;

the background process divides all the data nodes into a plurality of local groups according to the network topology distance among the data nodes;

the background process reads all data blocks on the data nodes in each local group and generates a local check block by using the regenerated code;

the local check blocks are stored on local check nodes within the corresponding local group.

Further, the process of performing the code regeneration coding on all the data blocks to generate the global check block includes:

and the background process of the storage system reads all data blocks on all data nodes, generates a global check block by using the regeneration code, and stores the global check block on a global check node in the storage system.

Further, the process of the client executing the Get command to read the object file, decoding and merging the object file and then sending the decoded object file to the client includes:

the client sends a Get command to request to read the object file, and the storage system receives the Get command and inquires metadata to obtain a data node where the data block of the object is located;

when all the data nodes are normal, reading the data blocks of the object from each data node in parallel;

in the parallel process of reading the data blocks of the object from each data node, calculating and checking the integrity of the data blocks, merging and sequencing all the data blocks to obtain a complete object file, and sending the complete object file to the client;

further, when the data node fails, the storage system queries a local group where the failed data node is located, and rebuilds a failed data block on the failed data node by using data in the check node and data in other healthy nodes in the local group;

at the same time of reconstruction, reading healthy data blocks belonging to the fault node on other healthy data nodes in parallel;

and merging and sequencing the reconstructed data blocks and the data blocks on the healthy data nodes to obtain a complete object file and sending the complete object file to the client.

Further, if the local group cannot reconstruct the data block, reconstructing the failed data block by using the global check block;

reading the reconstructed data block, merging the reconstructed data block with the data block read on the normal node, and returning to the object file;

and executing reconstruction operation on a plurality of failed data blocks on the failed data node in parallel.

Further, a manual fault node data recovery flow is set in the client, and the working process comprises the following steps:

the user manually opens a recovery flow, and the storage system inquires the number of normal nodes in a local group where the fault node is located;

if the number of the normal nodes in the local group meets the recovery requirement, reading the data blocks of the normal nodes and recovering the data by using the local check blocks;

if the number of normal nodes in the local group does not meet the recovery requirement, the storage system uses the global check block to recover the data.

Compared with the prior art, the data layout and encoding method of the large-scale object storage system has the following beneficial effects:

1) The method comprises the steps of dividing an object into a series of data blocks of geometric sequences, relieving read amplification through the minimum block in the geometric sequences, dividing an even-sized object into a series of data blocks of geometric sequences just after the object is logically filled with the even-sized object if the object is not even-sized, dividing the object into a series of data blocks of geometric sequences according to a geometric division algorithm, avoiding the additional expenditure caused by mixed coding, and fully playing the high recovery performance of a reproduction code.

2) All data blocks of the object are discretely placed on magnetic disks of different nodes, and a single-node downtime only damages a part of data of the object, so that the repair cost is lower when the reading is degraded.

3) Dividing all data nodes in the system into a plurality of local groups according to the distance, generating local check blocks for the data nodes in each local group, storing the local check blocks in the local group, generating global check blocks for the data blocks on all the data nodes, storing the global check blocks in the global check nodes, and performing repair operation of degradation reading only in the local groups.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of the overall structure of a large-scale object storage system based on a discrete geometry encoded data layout according to an embodiment of the present invention;

FIG. 2 is a flowchart of a client executing Put command to place an object according to an embodiment of the present invention;

FIG. 3 is a flow chart of generating parity by a client executing a generating parity command according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a client executing Get command read object according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of a client executing a Recovery command to recover data on a failed node according to an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

s1, a client executes a Put command to upload an object file, a storage system receives the object file, logically modifies the size of the object file into an even number, and cuts the object file into a plurality of data blocks with a series of geometric sequence sizes; distributing the data blocks to different data nodes, and recording the distribution condition of the data blocks;

s2, the storage system divides all the data nodes into a plurality of local groups, and performs regenerative code encoding on data blocks in the data nodes in the local groups to generate local check blocks; the storage system executes the regeneration code coding to all the data blocks to generate a global check block;

s3, the client executes the Get command, reads data on the corresponding data node, and merges the data into the original object file; if the node where the data block is located has a fault, the data block is recovered through the local check block and then sent to the client; and recovering through the global check block when the local check block cannot be recovered.

The process of logically modifying the object file size to an even number in step S1 includes:

The process of segmenting the object file into a plurality of data blocks of a series of geometric sequence sizes in step S1 includes:

filling the object file according to the logic size, and dividing the object file into a series of data blocks with the geometric sequence size according to a geometric dividing algorithm;

when the data blocks are distributed to different data nodes for storage, the data blocks do not contain filled contents.

The process of distributing the data blocks to different data nodes and recording the distribution condition of the data blocks comprises the following steps:

In step S1, an even-sized object is just cut into a series of geometric sequence-sized data blocks by a geometric division algorithm, so that the characteristic that a regenerated code takes the blocks as coding granularity is met, and the recovery performance of the regenerated code can be fully utilized during recovery operation;

meanwhile, the data blocks of the object are in the characteristic of a geometric sequence, and the phenomenon of reading amplification can be relieved through the minimum block in the geometric sequence so as to reduce degradation reading delay;

while an even-sized object, in order to segment the object into a series of geometrically sized data blocks, first cuts the header of the object out of a portion that is insufficient to form the coding granularity of the reproduction code.

By logically dividing the objects with the non-even number size according to the even number, the objects can be just divided into a series of data blocks with geometric sequences, and the hybrid coding of erasure codes and regeneration codes is not needed for eliminating the read amplification caused by the head data of the objects as in other methods, so that the recovery throughput of the system is improved on the premise of eliminating the read amplification;

the data blocks of the object are discretely placed on the magnetic disks of different nodes, so that the reliability and the parallelism of the object can be improved;

the method is characterized in that when one node is down, an object to be read only loses a part of data, instead of losing the whole object like a continuous layout, when the object is degraded and read, only a part of data blocks of the object need to be repaired, and then the data blocks on healthy nodes are read out in parallel.

The characteristic of discrete placement of the data blocks reduces the repair cost, improves the parallelism and reduces the degradation reading delay.

In step S2, the storage system performs a regenerative code encoding on the data block in the data node, and the process of generating the local check block includes:

In step S2, the process of performing the code regeneration encoding on all the data blocks to generate the global parity check block includes:

In step S2, the DG-LRCs encoding scheme is used to promote the locality of the regeneration code (DG-LRCs encoding scheme is the whole process of generating a local checksum and generating a global checksum), and for the recovery operation, when one node is down, the data of the rest nodes in the local group only need to be read and decoded to recover the data, thereby reducing the recovery cost and improving the recovery throughput of the system;

for degraded read operations, the nature of discrete placement of data blocks, while reducing the amount of data that needs to be repaired when degraded reads, and improving parallelism, more IOPS increases the overall latency of degraded reads for small objects;

in the DG-LRCs coding scheme, the repair operation of the degradation reading can be completed only through a local group, and the degradation reading delay and the recovery efficiency are further reduced by reducing the IOPS and the data volume required by repair.

In step S3, the client executes the Get command, reads data on the corresponding data node, and merges the data into the original object file, which includes:

the client sends a Get command to request to read the object file, and the storage system receives the Get command and inquires metadata to obtain a data node where a data block to which the object belongs is located;

and in the parallel process of reading the data blocks of the object from each data node, calculating and checking the integrity of the data blocks, merging and sequencing all the data blocks to obtain a complete object file, and sending the complete object file to the client.

When a data node fails, the storage system queries a local group where the failed data node is located, and rebuilds a failed data block on the failed data node by using data in check nodes and data in other healthy nodes in the local group;

In step S3, if the local group cannot reconstruct the data block, reconstructing the failed data block by using the global check block;

and executing data restoration and data transmission operations on a plurality of failed data blocks on the failed data nodes in parallel through a pipeline technology.

In step S3, under a discrete geometrical layout, there may be multiple data blocks belonging to the object on a single node;

under the condition that the node is down, the method adopts a pipeline parallel mode to repair the data block, and utilizes the geometric characteristic to eliminate the reading amplification, thereby reducing the degradation reading delay;

but discrete features can result in higher IOPS at degraded reads, can increase the overall latency of degraded reads for small objects, and utilizing DG-LRCs can reduce the IOPS from discrete features by repairing lost data blocks within the local group.

In the case of multiple node downtime, resulting in a local group losing fault tolerance, DG-LRCs may degrade into a pattern where the regenerated code incorporates a discrete geometry, but degrading read performance is still better than the regenerated code incorporates existing continuous and striped layouts.

S4, setting a manual fault node data recovery flow in the client, wherein the working process comprises the following steps:

In step S4, under the best condition, the local group where the fault node is located has no other damaged nodes, and the lost data can be repaired by reading the data of the nodes in the local group by utilizing DG-LRCs, so that the repair cost is reduced to a great extent, and the recovery throughput is improved;

in the worst case, the number of failed nodes exceeds the fault tolerance of the local group, so DG-LRCs will degrade into a pattern of regenerated codes combined with discrete geometry, and the recovery performance is also superior to regenerated codes combined with existing continuous and striped layouts.

A working process;

as shown in FIG. 1, when a client executes a Put command to place a new object, the first step is to divide the object into a series of geometrically sequential data blocks, and the second step is to discretely place the data blocks in corresponding containers on different disks.

For example, there is a container for 4M blocks of data on different nodes. Data blocks belonging to the same object are placed on different disks, and containers of the same size on each disk form a stripe.

And then the client executes a generating procedure command, the object on the disk is divided into a plurality of local groups according to the distance by using a DG-LRCs coding scheme through a background process, local check blocks are generated for the data nodes in each local group and stored in the corresponding local check nodes, and global check blocks are generated for the data blocks on all the data nodes and stored in the global check nodes.

And when no node fails, reading the object data block from the disk of the corresponding data node, merging the object data block into a primary object, and sending the primary object to the client.

When the node fails, a degraded read operation is triggered. If the damaged data can be repaired through the local group, the needed data is read in the nodes of the local group, the damaged data block of the object is decoded and repaired (if a plurality of data blocks of the object exist on the downtime node, the pipeline operation can be automatically started), and meanwhile, other data blocks of the object are read from the healthy nodes in parallel, and all the data blocks are combined into the original object and sent to the client.

If the corrupted data cannot be repaired by the local group, the corrupted data block is repaired by the global parity block.

When a node fails, the client restores the data node by executing the Recovery command, writes the restored data to a new node, and first tries to read the data of the healthy node in the local group and decodes the data for restoration.

If a plurality of nodes in the local group are down and the fault tolerance of the local group is exceeded, damaged data blocks are repaired through the global check block.

As shown in fig. 2: the client executes Put command to place the object:

firstly, logically constructing an index according to the size of an object, and if the object is of even size, directly dividing the object into a series of data blocks of geometric sequence size according to a geometric division algorithm;

if the object is not even in size, logically filling the object into even numbers, and then dividing the object into a series of data blocks with geometric sequence size according to a geometric division algorithm;

then, distributing disk IDs of different nodes to the data blocks of the object through a load balancing algorithm, and completing index construction;

the data blocks of the object are discretely placed on the disks of the corresponding nodes.

As shown in fig. 3: the client executes the GenerateParity command to generate a parity check:

generating parity check by using DG-LRCs coding scheme to the object on the disk through background process to improve the reliability of the system;

dividing all data nodes in a system into a plurality of local groups according to distances, generating local check blocks for the data nodes in each local group, and storing the local check blocks into corresponding local check nodes;

and generating global check blocks for the data blocks on all the data nodes and storing the global check blocks into the global check nodes.

As shown in fig. 4: the client executes Get command to read the object:

when no node fails, reading the object data block from the disk of the corresponding data node, merging the object data block into an original object, and sending the original object to the client, and when the node fails, triggering a degradation reading operation;

if the damaged data can be repaired through the local group, reading the needed data in the nodes of the local group and decoding to repair the damaged data blocks of the object (if a plurality of data blocks of the object exist on the downtime node, automatically starting the pipeline operation);

simultaneously reading other data blocks of the object from the healthy node in parallel, merging all the data blocks into a primary object, and sending the primary object to the client;

As shown in fig. 5: after the node fails, the client executes a Recovery command to recover the data on the failed node;

firstly, the data of healthy nodes in a local group are tried to be read and decoded for recovery;

The geometric partitioning algorithm may be formulated byExpressed, there are two predefined parameters s ₀ And q, s ₀ Is the initial value of the geometric sequence, i.e., the minimum bucket size, q is the common ratio of the sequences, these twoThe parameters are set after the actual production environment is measured;

s is the size of the object, r=s mod S ₀ Unlike the first generation algorithm, we a _i Refers to the number of data blocks of the same size, n being the number of blocks into which the object is divided.

Wherein the larger s ₀ The size of the block can be enlarged, and the increase of the average disk reading bandwidth can be obtained; and s is ₀ Larger may also increase the overhead of pipelining and thus the degraded read latency, as the pipeline requires the first block to be smaller to start the pipeline;

and smaller q can reduce the proportion among different blocks so as to improve the pipeline efficiency, and the cost is that the sizes of all sub-blocks are smaller, and the recovery efficiency is reduced.

In order to improve the encoding and decoding efficiency and eliminate the read amplification, the size of a large object needs to be logically acquiescing as an even number and then geometric segmentation is carried out;

in order to adapt the geometric partitioning algorithm to discrete geometry, threshold parameters are added as a boundary to distinguish large and small objects, since only for sizes larger than S ₀ Logic filling is carried out on the object of the (a);

the geometric partitioning algorithm then updates the size value of the object to its nearest even size.

The MaxBlockSize parameter is added in the geometric partitioning algorithm, so that the maximum block to be partitioned of the object can be flexibly specified according to actual production conditions, and for scenes with large objects, maxBlockSize is set smaller, so that more blocks with the size of MaxBlockSize can be partitioned;

the data blocks with similar proportions can improve the efficiency of a pipeline in degradation reading, and the size of the block is relatively large, so that the high recovery efficiency of the regenerated code can be fully exerted, and the low degradation reading delay is obtained.

The pseudo code of the geometric partitioning algorithm is as follows:

those of ordinary skill in the art will appreciate that the elements and method steps of each example described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the elements and steps of each example have been described generally in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in this application, it should be understood that the disclosed methods and systems may be implemented in other ways. For example, the above-described division of units is merely a logical function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The units may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A method for data placement and encoding for a large-scale object storage system, comprising:

recovering through the global check block when the local check block cannot be recovered;

the process of splitting an object file into a series of geometrically sized data blocks includes:

the storage system performs a regenerative code encoding on a data block in a data node, and the process of generating a local check block includes:

storing the local check blocks on local check nodes in the corresponding local groups;

geometric partitioning algorithm:

wherein s is ₀ Is the initial value of the geometric sequence, q is the public ratio of the sequence, and two parameters are set after the actual production environment is measured;

s is the size of the object, r=s mod S ₀ ,a _i N is the number of blocks into which the object is divided for the number of data blocks of the same size;

performing a regeneration code encoding on all the data blocks, the process of generating a global parity block comprising:

the background process of the storage system reads all data blocks on all data nodes, generates global check blocks by using the regeneration code codes, and stores the global check blocks on global check nodes in the storage system;

logically modifying the object file size to an even number includes:

2. The method for data placement and encoding of a large-scale object storage system as defined in claim 1, wherein:

3. The method for data placement and encoding of a large-scale object storage system as defined in claim 1, wherein:

the client executes the Get command, reads data on the corresponding data node, and merges the data into the original object file, wherein the process of merging the data into the original object file comprises the following steps:

4. A data layout and encoding method for a large-scale object storage system according to claim 3, wherein:

when a data node fails, the storage system queries a local group where the failed data node is located, and rebuilds a failed data block on the failed data node by using data in check nodes in the local group and data in other healthy nodes in the local group;

5. The method for data placement and encoding of a large-scale object storage system as defined in claim 4, wherein:

if the local group cannot reconstruct the data block, reconstructing the failed data block by using the global check block;

6. The method for data placement and encoding of a large-scale object storage system as defined in claim 1, wherein:

the method also comprises the step of setting a manual fault node data recovery flow in the client, wherein the flow is as follows: