CN111488238B

CN111488238B - Block storage node data restoration method and storage medium

Info

Publication number: CN111488238B
Application number: CN202010588697.4A
Authority: CN
Inventors: 邱重阳; 童颖睿; 陈靓
Original assignee: Nanjing Peng Yun Network Technology Co ltd
Current assignee: Nanjing Peng Yun Network Technology Co ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-09-18
Anticipated expiration: 2040-06-24
Also published as: CN111488238A

Abstract

The invention discloses a data recovery method for a block storage node, which comprises the steps that a fault recovery node initiates a data recovery starting request to a main node, and the main node receives the data recovery request and returns the latest log ID to the fault recovery node; carrying out log synchronization between the nodes and the main node in fault repair, and marking whether the current page needs to be repaired or not according to the synchronized log condition; for the pages marked as needing to be repaired, the nodes in fault repair register to the QOS controller and apply for the number of the pages needed for repair; and the node in fault repair sends a data repair request to the main node to repair the data. The invention also provides a storage medium, which can ensure that the read-write service to the client is not interrupted while the data is repaired.

Description

Block storage node data restoration method and storage medium

Technical Field

The present invention belongs to the field of distributed storage, and more particularly, to a method for restoring data of a block storage node and a storage medium.

Background

With the rapid development of the internet and the arrival of a big data era, the dependence of enterprises on storage is increasingly increased, and the cost of a large number of high-end hosts and traditional storage is very high; and the low-end blade machine and the cheap disk are the preferred storage architecture of more and more enterprises together with the distributed storage software. In the context of large-scale data storage nodes, host failures and disk failures of the storage nodes are not sporadic events but are normal. How to solve the problem of providing a storage service with high availability and high security in the case of a normalized hardware failure is a problem to be considered by all distributed storage service providers.

Currently, repair techniques for distributed data storage are copy-based repair, code-based repair, and router-acceleration-based repair.

Copy-based data repair: the storage node stores copy data of a source file, and the newNode acquires data from any provider during repair and can also download data from multiple providers in parallel to reduce transmission time. When a copy is lost or damaged, the system needs to establish a new copy, and therefore the system selects a storage node as a newNode, the newNode receives data from at least one storage node, and a node providing data to the newNode is called provider.

The disadvantages of this technique are: the nodes need to store a large amount of data, each storage node needs to store one file copy, the storage redundancy is large, and a large amount of storage resources are wasted. The repair time is long, the whole file needs to be transmitted, and a large amount of network bandwidth resources are occupied.

Data repair based on coding (erasure coding): the source file is encoded before being stored to the storage node. The whole file is divided into k blocks, n coding blocks can be obtained after coding, and any k blocks in the n coding blocks can restore the source file. Each storage node stores one coding block. When repairing, the newNode at least needs to download the coding blocks from the k providers, and the newNode re-encodes the received coding blocks to obtain a new coding block.

However, erasure codes present a problem in repairing corrupted data nodes: repairing a data block of size M1 requires downloading a total of data blocks of size k × M1 from k different nodes over the network connection, which makes repairing bandwidth expensive.

The data repair method based on the router acceleration improves the repair efficiency, but because all repair management is also responsible for the management nodes, the management nodes have larger load and have certain requirements on the performance and the function of the router.

Currently, the main adopted method for repairing damaged data nodes is as follows: and physically isolating the storage nodes with the damaged data, identifying the damaged parts in the storage nodes, and performing overwriting operation on the originally stored data in the damaged parts, wherein if the data is written successfully, the damaged parts are repaired.

When the damaged part is repaired, the storage node of the damaged part needs to be isolated from the system, and the storage node is repaired independently, so that the distributed storage system cannot feed back the reading request of the client in the process, and only after the repair is successful, the distributed storage system can feed back the reading request of the client normally, namely, the service interruption phenomenon occurs in the distributed storage system in the data repair process, and the service performance of the distributed storage system is influenced.

The invention improves the data repair technology based on the copy, so as to overcome the problem that the service interruption occurs in the data repair process in the prior art.

Disclosure of Invention

1. Problems to be solved

Aiming at the problems that in the data repair process of the data repair technology based on the copy in the prior art, the nodes in the copy repair process need to process both data repair and external service, and the data repair and the external service are easy to influence each other to cause, the data repair technology based on the copy comprises the following three points:

(1) the data writing in the data repair process causes the reduction of the read-write performance of the external service,

(2) data writing for data repair conflicts with data writing for external services easily occur, which causes interruption of the external services,

(3) data repair and external services may operate the disk at the same time, and the performance of the disk cannot be effectively utilized, resulting in an elongated data repair process.

The invention provides a data recovery method for a block storage node and a storage medium.

2. Technical scheme

In order to solve the problems, the technical scheme adopted by the invention is as follows: a data repair method for a block storage node comprises the following steps:

s1, the node in fault repair sends a request for starting data repair to the main node, and the main node receives the data repair request and returns the latest log ID to the node in fault repair;

s2, carrying out log synchronization between the nodes and the main node in fault recovery, and marking whether the current page needs to be recovered according to the synchronized log condition;

if Page_RIf the log ID is less than or equal to the ID of the latest log of the main node, the Page is directly discarded_RAnd sets Page_RIn order to need to repair;

if Page_RIf the log ID is greater than the ID of the latest log of the master node, then:

a、if Page_RCan cover the data of the whole Page, then Page_RWrite the Page to the journal of, and set Page_RTo have completed the repair;

b. if Page_RCannot cover the entire Page of data, and Page_RIf the repair is not completed, the Page is directly connected_RThe log data is lost and the client is informed of the completion of the repair through the data distribution module, and then the Page is set_RLog synchronization is carried out after the ID of the latest Log of the main node reaches the Max Log ID, wherein Page_RFor the Page of the node in fault repair, Max Log ID is Page_RLog ID of (c);

and S3, performing data repair on the pages marked as needing repair.

When the technical scheme is used for repairing data, the minimum IO reading unit page is used as a unit for repairing, the accuracy of data repairing can be guaranteed, and if the content in the log cannot cover the whole page, namely the data repairing cannot be carried out temporarily, the node loses the content in the log in fault repairing, and sends information to the client to indicate that the repairing is finished, so that the interruption of reading and writing services of the client in the data repairing process is avoided, and the data repairing is carried out when the latest log of the main node is waited to be as new as the log of the original data repairing node, so that the correctness of the data repairing is guaranteed.

Further, the step S3 includes the following steps:

s31, the node in fault repairing applies for registration to the QOS controller, if no other data segment unit is repaired at present, the node in fault repairing is allowed to register, otherwise, the node is not registered;

s32, the successfully registered failure repairing node applies for repairing data volume to the QOS controller, and the QOS controller determines the repairing data volume distributed to the failure repairing node according to the current disk idle rate and the disk throughput;

and S33, the QOS controller adjusts the repair data volume according to the disk idle rate and the disk throughput.

The QOS controller controls and adjusts the repair data volume, so that a good repair rate can be obtained.

Further, the step S32 includes the following steps:

a. if the read-write service of the client does not exist currently, the QOS controller sets the repair data volume as the throughput/P of the disk, wherein P is the size of a page;

b. if the client read-write service exists, the QOS controller further judges whether the current disk has idle resources;

c. if no free resources exist, setting the repair data volume as a preset value;

d. if the node has idle resources, further judging whether the node in fault repair is the first application repair; if so, setting the repair data volume of the node in fault repair as a preset value;

if not, the QOS controller determines the repair data quantity applied at the time according to the repair data quantity applied at the last time and the disk throughput.

The QOS controller determines the repair data volume distributed to the node in fault repair according to the current disk idle rate and the disk throughput, so that better repair rate can be further obtained, and the utilization rate of the disk is ensured.

Further, if the current disk has idle resources and the node in the fault repair is not the first application repair, the QOS controller determines the repair data volume according to the current disk utilization rate.

Further, if the disk utilization rate is less than or equal to 75%, the applied repair data volume is 1M (1+ (1-disk utilization rate)/P); if the disk utilization ratio is more than 75%, the applied repair data amount = Max (1M/P, 1M (1-disk utilization ratio)/P).

Further, the step S33 includes:

a. if the data is repaired according to the preset repair data volume and the disk idle rate is more than or equal to 50 percent, increasing the repair data volume;

b. if the client requests an increase during repair, the QOS controller limits the amount of repair data for the subsequent application.

In the data repair process, the QOS controller adjusts the data repair amount in the data repair process, and improves the data repair rate while ensuring the utilization rate of the disk.

The invention also provides a storage medium on which a computer program is stored, which when executed, implements the method described above.

3. Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

(1) during the data recovery period, the external read-write service of the client is not interrupted, the read-write speed of the client is not influenced, and better availability is provided;

(2) the invention utilizes the QOS controller to adjust the data repair quantity, so that the applied data repair quantity is more consistent with the repair rate of the current disk throughput, thereby improving the data repair efficiency.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of parallel access of a data repair flow and a client write flow in the present invention;

fig. 3 is a flow chart of the QOS controller operation in the present invention.

Detailed Description

The invention is further described with reference to specific examples.

As is known, in order to improve the security of data, distributed block storage sets multiple backups for each data, and the common data backup strategies include two backups, three backups, five backups, and the like. The data nodes are generally divided into main nodes and standby nodes, and the main nodes and the standby nodes form a node group to provide distributed storage service externally.

The invention is mainly a solution proposed on the premise of a plurality of backups, and the precondition is that any Quorum backups (in general, Quorum refers to most of the backups, more than half) in the plurality of backups can form complete data, even if one backup is in a Joining state in fault repair; it should be noted that, in the case of two backups, since the join state backup in the failure repair does not have the full amount of data, before the node is completely repaired, the fault tolerance of the system is reduced to the extent that the master node cannot fail, otherwise, the normal service cannot be provided. For clarity, the three backup policies are described as an example in this embodiment.

The whole block storage System is divided into a drive module, a data distribution module Coordinator and a data storage module DataNode, wherein the drive module is an external Interface provided by block storage, and a user can access the block storage System through the Interface and also can provide standard block storage access service such as ISCSI (Internet Small Computer System Interface) on the basis of the drive module. The data distribution module Coordinator is mainly responsible for data distribution, receives the data request of the drive module, distributes the data from the client to all backup nodes, distributes the data to different Segment units under the Segment, and returns the result to the drive module under the condition of ensuring the requirement of distributed consistency. The data storage module DataNode is responsible for data storage, namely, data is written into a disk, and the position indexes of the disk and the data are managed at the same time.

The invention divides the stored data into data Segment segments for management, the data Segment segments generate different data Segment units according to the backup strategy, and the data Segment units are generated under different data storage module DataNode nodes. The data backup of each Segment Unit in the Segment ensures the security of the data in the whole Segment.

Segment is a data Segment obtained by dividing storage data, external write data is stored in the Segment, data needs to be backed up in multiple ways according to a backup strategy, and a Segment Unit is a data Segment Unit. In the case of a triple backup policy, one Segment includes three Segment units, one of which is a master node, and the other two are backup nodes.

Data consistency is ensured among the Segment units through a Paxos protocol (consistency protocol based on message delivery), and of course, the data consistency can also be ensured through other similar protocols; when all Segment units of the Segment normally work, we call the stable state; when only half of the data Segment units are in normal operation, the data Segment units are called as available states; otherwise, it is referred to as an unavailable state.

The available state can also provide the block storage service normally, but the available state does not have high available fault tolerance capability any more, and if one node is abnormal, the data service is interrupted. Therefore, it is necessary to keep the data Segment in a stable state as much as possible, and to repair the failed node and reach the stable state as soon as possible.

When a data storage module DataNode node is recovered to use after a fault occurs, the node will add a data Segment to become a data Segment Unit constituting the data Segment, because the data Segment Unit loses the data write-in coming during the fault, the data needs to be synchronized with other members of the data Segment to reach the latest state, and at this time, the state of the data Segment Unit is called as the Joining state in fault repair.

When a brand-new data storage module DataNode node is added, the block storage system can balance the pressure of each node, a new data Segment Unit is created in the new node to replace the data Segment Unit of the old node, at this time, the newly created data Segment Unit needs to synchronize data to other members to reach the latest state, and at this time, the state of the data Segment Unit is also called a Joining state in fault repair.

The Joining state in fault repair is a state set by a data Segment Unit in the repair process, the data Segment Unit in data repair is called a Joining Segment Unit in fault repair, and only a standby node can have the state in fault repair, so the state is also called a standby node in fault repair; it should be noted that, the state in fault repair only occurs in the standby node, that is, the master node does not fail, but the standby node becomes the standby node after the master node fails and is repaired, and the standby node that does not fail is changed into the master node, so that the node that performs fault repair is the standby node.

The invention mainly solves the problem of high availability of data when the data Segment has a data Segment unit Joining Segment unit in fault repair; and meanwhile, under the control of a QOS (Quality of Service) controller, the fault or newly-added backup node data is restored to the latest state as soon as possible on the premise of not influencing the Service of the client. Simply, during data repair, the block store does not interrupt the external service and does not affect client IO throughput.

For easy understanding, we introduce the concept of Segment list Segment Membership, and for the Segment Unit in repair state, the Segment list Segment Membership is composed of a primary node, a backup node, and a repair backup node, and their respective state information. The Segment list Segment Membership is a data structure that exclusively stores the members of the Segment and their status. For example, in the three-backup strategy based on the present invention, under normal conditions, the data Segment has three members, one master node and two backup nodes; if a node is in the repair state, a member of the Segment becomes the node in the fault repair, and all the information is stored in the Segment list Segment Membership. In this case, any two nodes may constitute complete data.

The Segment Unit in Joining state in fault repair needs to perform new data writing and synchronization of lost data during fault at the same time. The problems of no interruption to external service, quality of service control and the like need to be considered at this time.

As shown in fig. 1 and fig. 2, the data repair process of the present invention is as follows:

1. and the node in fault repair initiates a data repair request to the main node according to the latest Log ID, namely the Log ID, owned by the node.

2. The main node returns to receive data restoration, and notifies the node in fault restoration of a latest Log IDLog ID, which is recorded as a Catch Up Log ID, wherein the Log Log is a method for writing recorded data in a storage system, and the Log Log is written first and then stores the data according to the content of the Log Log to avoid data abnormality, each Log has a Log ID, the Log ID is a natural number, the size of the Log ID is also the sequence of data writing, and the latest Log ID is also the largest, so the main node notifies the node in fault restoration of the latest Log ID of the current main node as a necessary step for data restoration, and the data restoration is performed according to the data of the Log.

3. The method comprises the following steps that a node performs data Log Log synchronization to a main node in fault recovery, and specifically comprises the following steps:

3.1 if the Log ID (identity) of the node in fault restoration is less than or equal to the Catch Up Log ID, directly discarding the data of the Log of the node in fault restoration and setting that the Page Page where the Log is located needs to copy the data from the main node; in this case, the latest data stored in the master node is more recent than the latest data stored in the Log of the node under fault recovery, so that the data in the Log of the node under fault recovery can be directly discarded and then copied from the master node.

3.2 if the Log ID Log ID of the node in fault repairing is larger than the Catch Up Log ID, setting the Log ID Log ID of the node in fault repairing to be Max Log ID. If the data owned by the Log of the node in fault repair can cover the whole Page, writing the data into the Page, and setting the Page to complete data repair, for convenience of description, the Page of the node in fault repair can be named as the Page_R(ii) a If the Log Log of the node in fault repair has insufficient data to cover the whole Page, and the Page of the node in fault repair_RIf the Log is marked to be required to repair data and the repair is not finished, directly dropping the data of the Log and informing the client of Page of the node in the fault repair_RAfter the Page is repaired, setting the Log ID of the main node required by the Page to reach the Max Log ID writing completion, and then performing Log LAnd g synchronization. If the data in the Log of the node in fault repair can cover the whole Page, writing the data in the Log of the node in fault repair into a disk; if the data in the Log is not enough to cover the whole Page, directly discarding the data in the Log, and repairing the fault to the Page of the node_RAnd marking the state which needs data repair and does not finish repair, and performing Log synchronization after the Log ID (identity) of the main node reaches the Max Log ID.

4. And the nodes in fault repair the pages marked as needing repair one by one, firstly apply for registration and repair data volume to the QOS controller, and then send data replication requests to the main nodes for repair.

5. And when all the pages marked as needing to be copied are repaired completely, marking that the data repair is completed, changing the nodes in the fault repair into normal standby nodes, and changing the data Segment into a stable state.

During node repair, the node in failover also needs to accept writes of new data so as not to be in a state that it needs to copy data to other nodes all the time. The new data writing and the data repair lost during the failure may involve the same Page, and if the new writing depends on the repair completion of the Page, the new writing may take a long time to succeed, which may eventually lead to service interruption of the client.

The invention can avoid the interruption of the service of the client in the data repair process, and sets a repair state for each Page in the data repair process, including the completion of the repair, the repair neutralization and the repair. The repaired Page can normally accept new writes; when a Page is to be repaired or repaired, if a new write can cover the whole Page, the data lost during the previous fault period can be considered to be covered, and the repair is not needed, so that the new write is directly accepted, and then the Page is set to be in a state of being repaired; if the new write-in can not cover the whole Page, the node directly discards the new write-in content in the fault repair, the data storage module DataNode indicates to the client through the data distribution module Coordinator that the Page has received the newly written-in content, that is, the repair is completed, so as to avoid interrupting the read-write service to the client, but in order to ensure the consistency and accuracy of the data, the Page needs to be set to a state to be repaired, and the host node needs to wait for the write-in of the latest Log, that is, the data repair is performed when the new write-in content can cover the whole Page.

It should be noted that a Page is the minimum management Unit of the storage space in the storage system, for example, if one Segment Unit includes 1G of storage space, it is divided into N Page pages, where N is a positive integer, and if the size of a Page is set to 8K, N = 128. Each Log belongs to one Page, and only belongs to one Page, and one Page can have a plurality of logs, so that Log writing is Page writing, namely data is stored in a disk; the Log is stored in the cache disk before being written into the disk, and the data recovery according to the present invention is to write the data corresponding to the latest Log from the cache disk into the disk, which should be understood by those skilled in the art. If a certain Log of a node in fault repair needs to be repaired, the whole Page needs to be repaired, namely all logs related to the whole Page are repaired, so that the Max Log ID of the Page needs to be submitted to the main node, namely data replication needs to be performed when the Log of the main node is up to date.

During data recovery, the data storage module DataNode performs effective flow control on data recovery by monitoring the throughput of read-write data of the disk. The specific control flow of the QOS controller is as follows:

1. the Segment Unit of the data Segment needing data repair applies for registration to the QOS controller;

2. the QOS controller checks whether the disk where the Segment Unit is located has the Segment Unit which is performing data repair, if so, the Segment Unit which is replying needs to be re-registered for a period of time; otherwise, the successful registration of the Segment Unit is returned to control that only one Segment Unit of the Segment Unit can be used for data repair in one disk at the same time.

3. And the successfully registered data Segment Unit generates a list according to the Page to be copied, and the QOS controller is required to apply for the repair data volume before sending a Page repair request to the main node every time. So that the amount of repair data to send a Page to the master node, i.e., the number of Page pages repaired at a time, can be determined.

4. The QOS controller collects the throughput of the disk and the idle rate of the disk to set the repair data volume which can be repaired currently, wherein the throughput of the disk refers to the flow of disk I/O (input/output) per second, namely the size of data written in the disk and read out of the disk.

a) If external reading and writing exist, the repair data volume applied by the Segment Unit for the first time is always a preset value, the normal reading and writing outside is not affected by repairing by using the value, and in the specific implementation, the preset value of the data repair volume applied by the Segment Unit for the first time can be set by a user according to the requirement, the preset value is related to the performance of a magnetic disk, and in the embodiment, the preset value can be set to be 1 MB/S;

b) if the data restoration is carried out according to the preset value of the data restoration amount and the disk has more idle rate, the QOS controller can provide more resources for the data restoration, namely, the restoration data amount is increased;

c) if more client read-write requests come in the period, the QOS controller can limit the repair data volume of the later application;

d) in the state of external read-write, the QOS controller controls the disk utilization rate caused by data repair to be maintained within a certain range, in this embodiment, this ratio may be set to 75% so as to avoid affecting the client IO experience, and of course, other values may be set as needed, and generally, the disk utilization rate may not be set to 75% or more without affecting the client IO experience.

e) If no external read-write request exists, the QOS controller can completely use the disk resources for data repair.

5. After one Segment Unit is repaired, the registration is canceled from the QOS controller, and another Segment Unit can be re-registered to continue working.

For example, the throughput of a disk is 200M/s, the minimum data volume provided for data repair is 1M/8K, the repair data volume is applied for the first time, if no client is written in, the repair data volume of 200M/8K is directly provided, otherwise, the repair data volume is 1M/8K; applying for repairing data volume for the second time, according to the utilization rate of the disk, if no client is writing, directly giving the repairing data volume of 200M/8K, if an external client is writing, seeing that the utilization rate of the disk does not exceed 75%, and if the utilization rate is less than or equal to 75%, setting the data volume applying for repairing as 1M (1+ (1-disk utilization rate))/8K; if the utilization rate of the disk exceeds 75%, setting the data volume applied for repairing as Max (1M/8K, 1M (1-disk utilization rate)/8K), namely the greater of 1M/8K and 1M (1-disk utilization rate)/8K; in this embodiment, the size of each Page is 8K for explanation, and in specific implementation, the size of one Page may be set according to the user requirement, and may be 16K or another value.

And repeating the algorithm for the second time in the later application, and finally, gradually adjusting the data repair to the repair data volume which is more consistent with the current disk throughput in the process of applying the repair data volume for the first time by the QOS controller so as to maintain a better repair rate.

The functions, if implemented in the form of software functional units and used as a stand-alone product, may be stored in a computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A data repair method for a block storage node is characterized by comprising the following steps: the method comprises the following steps:

s2, carrying out log synchronization between the node in fault repair and the main node, and marking whether the page of the node in fault repair needs to be repaired according to the synchronized log condition;

a. if Page_RCan cover the whole Page, then Page_RThe data corresponding to the log is written into the Page, and the Page is set_RTo have completed the repair;

b. if Page_RThe log of (1) can not cover the whole Page, and the repair is not completed, then the Page is directly processed_RThe log data is lost and the client is informed to show that the repair is completed, and the Page is sent_RSetting that the ID of the latest log needing the primary node reaches the MaxLog ID and then performing data correctionAnd wherein, Page_RFor the Page of the node in fault repair, Max Log ID is Page_RLogID of (1);

and S3, performing data repair on the pages marked as needing repair.

2. The block storage node data repair method of claim 1, wherein: the step S3 includes the steps of:

3. The block storage node data repair method of claim 2, wherein: the step S32 includes the steps of:

d. if the nodes have idle resources, further judging whether the nodes in fault restoration are applied for restoration for the first time, and if so, setting the restoration data volume of the nodes in fault restoration to be a preset value;

4. The block storage node data repair method of claim 3, wherein: and if the current disk has free resources and the node in the fault repairing is not the first application repairing, the QOS controller determines the repairing data volume according to the current disk utilization rate.

5. The block storage node data repair method of claim 4, wherein: if the disk utilization rate is less than or equal to 75%, the applied repair data volume is 1M (1+ (1-disk utilization rate)/P); if the disk utilization ratio is more than 75%, the applied repair data amount = Max (1M/P, 1M (1-disk utilization ratio)/P).

6. The block storage node data repair method of claim 3, wherein: the step S33 includes:

7. A storage medium, characterized by: stored thereon a computer program which, when executed, carries out the method of any one of claims 1 to 6.