WO2019232993A1

WO2019232993A1 - Adaptive data recovery flow control method and apparatus, electronic device and storage medium

Info

Publication number: WO2019232993A1
Application number: PCT/CN2018/108128
Authority: WO
Inventors: 陈学伟
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-04
Filing date: 2018-09-27
Publication date: 2019-12-12
Also published as: CN108804039B; CN108804039A

Abstract

Provided is an adaptive data recovery flow control method, comprising: periodically synchronizing information of storage nodes in a distributed storage system (S11); when it is detected that a storage node fails (S12), acquiring a storage list of the failed storage node (S13); identifying an IO load category of a user application in a previous statistical period (S14); calculating, according to the IO load category in the previous statistical period, a flow control threshold value corresponding to the current statistical period (S15); according to the storage list and the flow control threshold value corresponding to the current statistical period, executing a recovery operation on data of the failed storage node in the current statistical period (S16); and determining whether the recovery operation is performed on the data of the failed storage node in all the statistical periods, and if so, ending the process (S17). Further provided are an adaptive data recovery flow control apparatus, an electronic device and a storage medium. According to the method, the obvious impact on a normal input and output service performance can be avoided, while the data recovery efficiency of the large-scale distributed storage system is improved and the data loss risk is reduced, and the method also has a good flow control effect.

Description

Adaptive data recovery flow control method, device, electronic equipment and storage medium

This application claims the priority of the Chinese patent application filed on June 04, 2018 with the application number 201810565004.2 and the invention name "Adaptive Data Recovery Flow Control Method, Device, Electronic Equipment and Storage Medium", all of which are The contents are incorporated herein by reference.

Technical field

The present application relates to the field of computer technology, and in particular, to an adaptive data recovery flow control method, device, electronic device, and storage medium.

Background technique

With the advent of the era of big data and cloud computing, the data volume in various fields has shown a rapid growth trend. These growing amounts of data need to rely on large-scale distributed storage systems to achieve reliable storage and efficient access. However, the larger the storage system, the higher the probability of failure. In order to cope with possible failures at any time and to ensure the reliability of data storage, the distributed storage system needs data redundancy. A common data redundancy strategy is to store multiple copies of data on different physical nodes. When some copies are damaged, the damaged copies can be repaired based on the intact copies.

In addition, when expanding the capacity of a distributed storage system, a certain scale copy migration is required to ensure the balance of data distribution, and this data migration is also considered to be a special kind of data repair.

On the one hand, it is necessary to improve the efficiency of data repair to reduce the risk of data loss, but on the other hand, the storage system needs to ensure the efficient access of user applications to avoid the impact of data repair on the quality of service of normal business. How to better balance the data repair and normal The task allocation between data input and output services, while improving the efficiency of data repair, avoids a significant impact on normal data input and output business performance, and enables business systems to continuously and steadily obtain higher random input and output times per second (Input / Output Operations (Second, IOPS) and throughput are critical to improving the performance of distributed storage systems.

Summary of the Invention

In view of the above, it is necessary to propose an adaptive data recovery flow control method, device, electronic device and storage medium, which can improve the data recovery efficiency of a large-scale distributed storage system and reduce the risk of data loss while ensuring normal input and output. Service performance is not impacted, and has good flow control effects.

A first aspect of the present application provides an adaptive data recovery flow control method, where the method includes:

a) Periodically synchronize the information of each storage node in the distributed storage system;

b) Detect if any storage node has failed;

c) when a failure of a storage node is detected, obtaining a storage list of the failed storage node;

d) identify the IO load category of the user application in the previous statistical period;

e) Calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period;

f) performing a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period;

The foregoing steps d) -f) are repeatedly performed until a recovery operation is performed on data in all statistical periods of the failed storage node.

A second aspect of the present application provides an adaptive data recovery flow control device, where the device includes:

A synchronization module for regularly synchronizing information of each storage node in the distributed storage system;

A detection module for detecting whether a storage node has failed;

An obtaining module, configured to obtain a storage list of a failed storage node when the detection module detects a failure of the storage node;

Identification module, used to identify the IO load category of the user application in the previous statistical period;

A calculation module, configured to calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period;

The recovery module is configured to perform a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period.

A third aspect of the present application provides an electronic device. The electronic device includes a processor and a memory, where the memory is configured to store at least one instruction, and the processor is configured to execute the at least one instruction to implement the following steps:

b) Detect if any storage node has failed;

A fourth aspect of the present application provides a non-volatile readable storage medium. At least one instruction is stored on the non-volatile readable storage medium, and when the at least one instruction is executed by a processor, the following steps are implemented:

b) Detect if any storage node has failed;

The adaptive data recovery flow control method, device, electronic device and storage medium described in the present application can divide a recovery period into multiple statistical periods, and in each statistical period, according to the user application in the previous statistical period The IO load category of the device dynamically adjusts the corresponding flow control threshold in the current statistical period, and recovers the data in the current statistical period according to different flow control thresholds. When the IO load of user applications in the previous statistical period is high, reduce the flow control threshold for fault recovery in the current statistical period, so as to reduce the intensity of fault recovery and ensure the business IO load. In the previous statistical period, user applications When the I / O load intensity is low, increase the flow control threshold for fault recovery in the current statistical period, so as to achieve the goal of increasing the fault recovery intensity and recovering the distributed storage system to a healthy state as soon as possible. That is, this application can improve the data recovery efficiency of a large-scale distributed storage system and reduce the risk of data loss, while avoiding a significant impact on normal I / O business performance, and has a good flow control effect.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of an adaptive data recovery flow control method provided in Embodiment 1 of the present application.

FIG. 2 is a functional block diagram of an adaptive data recovery flow control device provided in Embodiment 2 of the present application.

FIG. 3 is a schematic diagram of an electronic device according to a third embodiment of the present application.

The following specific embodiments will further explain the present application in combination with the above drawings.

Detailed ways

In order to more clearly understand the foregoing objectives, features, and advantages of the present application, the present application is described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.

In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms used herein in the specification of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application.

The adaptive data recovery flow control method in the embodiment of the present application is applied to one or more electronic devices. The adaptive data recovery flow control method can also be applied to a hardware environment composed of an electronic device and a server connected to the electronic device through a network. The network includes, but is not limited to: a wide area network, a metropolitan area network, or a local area network. The adaptive data recovery flow control method in the embodiment of the present application may be executed by a server or an electronic device; it may also be executed jointly by the server and the electronic device.

For an electronic device that needs to perform an adaptive data recovery flow control method, the adaptive data recovery flow control function provided by the method of the present application can be directly integrated on the electronic device, or an Client. For another example, the method provided in this application can also be run on a device such as a server in the form of Software Development Kit (SDK), and provide an interface for adaptive data recovery flow control functions in the form of SDK, an electronic device. Or other devices can implement the function of adaptively controlling data recovery through the provided interface.

Example one

FIG. 1 is a flowchart of an adaptive data recovery flow control method provided in Embodiment 1 of the present application. According to different requirements, the execution order in this flowchart can be changed, and some steps can be omitted.

S11. Periodically synchronize information of each storage node in the distributed storage system.

In a preferred embodiment of the present application, the distributed storage system (hereinafter referred to as a storage system) adopts a cluster storage method for distributed data storage.

The distributed storage is a data storage technology that uses the remaining disk space on each storage system in the cluster through the network and integrates the storage resources of these scattered remaining disk spaces to form a virtual Storage device, which stores data in various corners of the cluster.

Therefore, each storage node described in this application is each sub storage system in the cluster. For example, the storage node may be a storage server, a computer, or a storage device.

In a preferred embodiment of the present application, the information of each storage node in the synchronized distributed storage system may include: 1) a storage center in the storage system performs information synchronization of each storage node; or 2) adopts In a decentralized method, any one storage node in the storage system initiates information synchronization of each storage node.

The synchronization of the information of each storage node may include, but is not limited to, synchronization of a CPU, a memory, a disk free space, and a list of stored files.

In a preferred embodiment of the present application, the storage file list records information such as the name, size, and location of data stored in each storage node.

S12. Detect whether any storage node has failed.

In the preferred embodiment of the present application, the failure of the storage node may be that any one or more storage nodes in the storage system cannot be started, powered off, or disconnected from the network, or any one of the storage systems or Disks in multiple storage nodes have failed, etc. Therefore, the detecting whether a storage node is faulty includes: detecting whether any one or more storage nodes in the storage system have failed to start, power off, or disconnected from the network, or the storage system. Whether disks in any one or more storage nodes have failed, etc.

When any one of the storage nodes in the storage system fails, such as failure to start, power off, or network disconnection, the failed storage node is disconnected from other storage nodes and / or storage centers. Therefore, the other storage nodes The node and / or storage center can detect that a storage node has failed.

When a disk in any storage node in the storage system fails, the synchronization information sent by the failed storage node to other storage nodes and / or storage centers will include the failure information of the disk. Other storage nodes and / or storage centers can detect that a storage node has failed.

When it is detected that a storage node has failed, step S13 is performed; when it is not detected that a storage node has failed, step S12 is continued.

S13. Acquire a storage list of the storage node that has failed.

In the preferred embodiment of the present application, obtaining the storage list of the storage node that has failed includes obtaining information such as the name, size, and location of data stored in the storage node that has failed.

S14. Identify the IO load category of the user application in the previous statistical period.

The entire process of storage node data from failure to complete recovery is called a recovery cycle. A recovery period may include multiple statistical periods, and a statistical period may be a preset time period. For example, a statistical period is set to 1 second.

In a preferred embodiment of the present application, the IO load category includes: a high load category, a normal load category, and a low load category.

Specifically, the identifying the IO load category of the user application in the previous statistical period may include:

(1) Obtain the data block size of each IO applied by the user in the previous statistical period, and calculate the average data block size of the IO in the previous statistical period.

The average data block size of the IO in the last statistical period may be calculated by using an arithmetic average algorithm, a geometric mean algorithm, or a root mean square algorithm.

The formula of the arithmetic mean algorithm is:

Among them, N is the number of data blocks of IO, and S _i is the data block size of each IO.

The formula of the geometric mean algorithm is:

The formula of the root mean square algorithm is:

For example, suppose that during the last statistical period, the user application has a total of ten IOs. The data block sizes of the ten IOs are: 2M, 1M, 3M, 0.5M, 10M, 4M, 0.1M, 1.2M, 5M. And 8M.

Calculating the average data block size of the IO in the previous statistical period by using the arithmetic average algorithm is:

Calculating the average data block size of the IO in the previous statistical period by using the geometric average algorithm is:

Calculating the average data block size of the IO in the previous statistical period by using the root mean square average algorithm is:

(2) Obtain the transmission delay of each data block in the last statistical period, and calculate the average data block delay of the IO in the last statistical period.

The transmission delay (referred to as the delay) refers to the time required for a node to enter a data block from the node to the transmission medium when transmitting data, that is, the time required for a sending site to start sending data frames to the completion of data frame transmission. The total time required for a receiving station, or the time required for a receiving station to start receiving data frames and finish receiving data frames.

In a preferred embodiment of the present application, the transmission delay of the data block may be obtained from a load measurement tool or a performance monitoring tool installed in each storage node.

As described above, the average data block delay of the IO in the last statistical period may also be calculated by using an arithmetic average algorithm, a geometric mean algorithm, or a root mean square algorithm. Assume that assuming that the transmission delays of ten IOs in the previous statistical period are: 1s, 0.8s, 1.5s, 0.4s, 5s, 2s, 0.02s, 0.6s, 3s, and 4.5s, then When the average IO block delay in the previous statistical period is calculated using the arithmetic mean algorithm, the result is:

(1s + 0.8s + 1.5s + 0.4s + 5s + 2s + 0.1s + 0.6s + 3s + 4.4s) = 1.88s.

It should be understood that if the average data block size of the IO in the previous statistical period is calculated using the arithmetic average algorithm, the average data block delay of the IO in the previous statistical period is also calculated using the arithmetic average algorithm; if The average data block size of the IO in the previous statistical period is calculated using the geometric mean algorithm, and the average data block delay of the IO in the previous statistical period is also calculated using the geometric mean algorithm; or The average data block size of the IO is calculated using the root mean square average algorithm, and the average data block delay of the IO in the previous statistical period is also calculated using the root mean square average algorithm.

(3) Obtain a preset reference value of the data block size of the IO and a reference value of the corresponding data block delay.

In a preferred embodiment of the present application, the reference value of the size of the IO data block and the reference value of the corresponding data block delay may be preset by an administrator of the storage system according to experience. For example, according to experience, when a 4K data block is transmitted, the delay is the smallest, and in the ideal state, it can reach 50ms, then the reference value of the IO data block size can be set to 4k, and the corresponding data block delay reference value can be set. It is 50ms.

(4) calculating the last statistic according to the average data block size, average data block delay, reference value of data block size, and corresponding reference value of data block delay of the IO in the previous statistical period IO load strength during the cycle.

For example, assuming that the average data block size of the IO in the previous statistical period is X, the average data block delay is Y, the reference value of the data block size is M, and the reference value of the corresponding data block delay is N , The calculation formula of the IO load intensity in the previous statistical period is:

(5) According to the IO load intensity in the last statistical period, use a pre-trained load classification model to determine the IO load category in the last statistical period.

Preferably, the load classification model includes, but is not limited to, a Support Vector Machine (SVM) model. Using the average data block size of the IO in the last statistical period, the average data block delay of the IO in the last statistical period, and the IO load intensity in the last statistical period as the load classification model The input is calculated by the load classification model, and the IO load category in the previous statistical period is output.

In a preferred embodiment of the present application, the training process of the load classification model includes:

1) Obtain the IO load data of the positive sample and the IO load data of the negative sample, and label the IO load data of the positive sample with the load category, so that the IO load data of the positive sample carries the IO load category label.

For example, select 500 IO load data corresponding to the high load category, normal load category, and low load category, and label the category of each IO load data. You can use "1" as the high load IO data label and "2" As a normal load IO data tag, "3" is used as a low load IO data tag.

2) Randomly divide the IO load data of the positive sample and the IO load data of the negative sample into a training set of a first preset ratio and a verification set of a second preset ratio, and use the training set to train the load classification Model, and use the validation set to verify the accuracy of the load classification model after training.

First distribute the training samples in the training sets of different load categories to different folders. For example, training samples of high load category are distributed to the first folder, training samples of normal load category are distributed to the second folder, and training samples of low load category are distributed to the third folder. Then extract training samples of the first preset ratio (for example, 70%) from different folders as the total training samples to train the load classification model, and take the remaining second preset ratios from different folders ( For example, 30%) of the training samples are used as the total test samples to verify the accuracy of the load classification model that has been trained.

3) If the accuracy rate is greater than or equal to a preset accuracy rate, end training, and use the trained load classification model as a classifier to identify the IO load category in the current statistical period; if the accuracy rate is less than When the accuracy is preset, the number of positive samples and the number of negative samples are increased to retrain the load classification model until the accuracy is greater than or equal to the preset accuracy.

S15. Calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period.

The flow control refers to flow control. There are two methods for implementing flow control: one is to implement flow control based on source address, destination address, source port, destination port, and protocol type through the QoS module of routers and switches; the other is to use professional flow control equipment Implement application-based flow control.

Each statistical period in the recovery period can correspond to a flow control threshold. The flow control threshold corresponding to each statistical cycle is dynamically adjusted. The flow control threshold corresponding to the current statistical cycle can be calculated based on the IO load category in the previous statistical cycle. The flow control threshold corresponding to the next statistical cycle can be calculated according to the current statistical cycle. Calculated within the IO load category.

It should be noted that the flow control threshold corresponding to the first statistical period in the recovery period of this application is a preset flow control threshold, which can be preset by the administrator of the storage system based on experience. That is, when a preset flow control threshold is used as the flow control threshold of the first statistical period in the recovery period, the flow control threshold corresponding to the second statistical period is calculated according to the IO load category in the first statistical period; according to The IO load category in the second statistical period calculates the flow control threshold corresponding to the third statistical period; and so on.

Specifically, calculating the flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period may include:

1) When the IO load category in the previous statistical cycle is a high load category, reduce the flow control threshold corresponding to the previous statistical cycle by a first preset amplitude to obtain the flow control threshold corresponding to the current statistical cycle.

When the IO load in the previous statistical period is a high load, the flow control threshold is reduced according to the first preset amplitude, so as to perform a recovery operation on the data of the storage node with a low flow control threshold in the current statistical period. Reduce the speed of data recovery to ensure efficient access to user applications.

In a preferred embodiment of the present application, the first preset amplitude may be 1/2 of a flow control threshold corresponding to a previous statistical period. That is, the flow control threshold corresponding to the current statistical period is 1/2 of the flow control threshold corresponding to the previous statistical period, and the flow control threshold corresponding to the next statistical period is 1/2 of the flow control threshold corresponding to the current statistical period.

2) When the IO load category in the previous statistical cycle is a low load category, increase the flow control threshold corresponding to the previous statistical cycle by a second preset amplitude to obtain the flow control threshold corresponding to the next statistical cycle.

When the IO load in the previous statistical period is low, the flow control threshold is increased according to the second preset amplitude to perform a recovery operation on the data of the storage node with a high flow control threshold in the current statistical period. On the basis of ensuring the access quality of user applications, the speed of data recovery is improved.

In a preferred embodiment of the present application, the second preset amplitude may be 1.5 times a flow control threshold corresponding to a previous statistical period. That is, the flow control threshold corresponding to the current statistical period is 1.5 times the flow control threshold corresponding to the previous statistical period, and the flow control threshold corresponding to the next statistical period is 1.5 times the flow control threshold corresponding to the current statistical period.

3) When the IO load category in the previous statistical cycle is a normal load category, the flow control threshold corresponding to the previous statistical cycle is used as the flow control threshold corresponding to the current statistical cycle.

S16. Perform a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period.

S17. Determine whether a recovery operation is performed on data in all statistical periods of the faulty storage node.

When it is determined that a recovery operation is performed on data in all statistical periods of the failed storage node, the process ends; when it is determined that a recovery operation is not performed on data in all statistical periods of the failed storage node, Return to step S14 described above.

In summary, the adaptive data recovery flow control method described in this application periodically synchronizes information of each storage node in a distributed storage system; when a failure of a storage node is detected, the failed storage is acquired Node's storage list; identify the IO load category of the user application in the previous statistical period; calculate the flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period; according to the storage list and the flow corresponding to the current statistical period Control a threshold, and perform a recovery operation on data in the current statistical period of the failed storage node until a recovery operation is performed on data in all statistical periods of the failed storage node. This application can divide a recovery period into multiple statistical periods. In each statistical period, dynamically adjust the corresponding flow control threshold in the current statistical period according to the IO load category applied by the user in the previous statistical period. Control the threshold to restore the data in the current statistical period. When the IO load of user applications in the previous statistical period is high, reduce the flow control threshold for fault recovery in the current statistical period, so as to reduce the intensity of fault recovery and ensure the business IO load. In the previous statistical period, user applications When the I / O load intensity is low, increase the flow control threshold for fault recovery in the current statistical period, so as to achieve the goal of increasing the fault recovery intensity and recovering the distributed storage system to a healthy state as soon as possible. That is, this application can improve the data recovery efficiency of the large-scale distributed storage system and reduce the risk of data loss, while avoiding a significant impact on the performance of normal input and output services, and has a good flow control effect.

Secondly, the corresponding flow control threshold in the current statistical cycle is automatically adjusted dynamically according to the IO load category of the user application in the previous statistical cycle, without manual adjustment by the manager, which reduces the workload of the manager and avoids The problem of inaccurate adjustment caused by subjective factors can be dynamically adjusted with changes in the distributed storage system system and its hardware facilities, and has high reliability.

The foregoing is only a specific implementation of this application, but the scope of protection of this application is not limited to this. For those of ordinary skill in the art, without departing from the creative concept of this application, they can also make Improvement, but these all belong to the protection scope of this application.

In the following, the functional modules and hardware structures of the electronic devices that implement the above-mentioned adaptive data recovery flow control method are described with reference to Figures 2 to 3.

Example two

FIG. 2 is a functional module diagram of a preferred embodiment of an adaptive data recovery flow control device of the present application.

In some embodiments, the adaptive data recovery flow control device 20 (hereinafter referred to as "data recovery flow control device 20") runs in an electronic device. The data recovery flow control device 20 may include a plurality of functional modules composed of program code segments. The program code of each program segment in the data recovery flow control device 20 may be stored in a memory and executed by at least one processor to execute (see FIG. 1 and related description for details) adaptive data recovery flow control. method.

In this embodiment, the data recovery flow control device 20 of the electronic device may be divided into a plurality of functional modules according to functions performed by the device. The functional modules may include a synchronization module 201, a detection module 202, an acquisition module 203, an identification module 204, a training module 205, a calculation module 206 / recovery module 207, and a judgment module 208. The module referred to in the present application refers to a series of computer-readable instruction segments capable of being executed by at least one processor and capable of performing fixed functions, which are stored in a memory. In some embodiments, functions of each module will be described in detail in subsequent embodiments.

The synchronization module 201 is configured to periodically synchronize information of each storage node in the distributed storage system.

In a preferred embodiment of the present application, the synchronization module 201 synchronizing information of each storage node in the distributed storage system may include: 1) a storage center in the storage system performs information synchronization of each storage node; or 2) Using a decentralized method, any one storage node in the storage system initiates information synchronization of each storage node.

The detection module 202 is configured to detect whether a storage node has failed.

In the preferred embodiment of the present application, the failure of the storage node may be that any one or more storage nodes in the storage system cannot be started, powered off, or disconnected from the network, or any one of the storage systems or Disks in multiple storage nodes have failed, etc. Therefore, the detection module 202 detects whether a storage node has failed, including: detecting whether any one or more storage nodes in the storage system have failed to start, power off, or disconnected from the network; Describes whether the disks in any one or more storage nodes in the storage system have failed.

An obtaining module 203 is configured to obtain a storage list of a storage node that has failed when the detection module 202 detects that a storage node has failed.

The identification module 204 is configured to identify an IO load category of a user application in a previous statistical period.

Specifically, the identification module 204 identifying the IO load category of the user application in the previous statistical period may include:

The formula of the arithmetic mean algorithm is:

The formula of the geometric mean algorithm is:

The formula of the root mean square algorithm is:

For example, suppose that during the last statistical period, the user application has a total of ten IOs. The data block sizes of the ten IOs are: 2M, 1M, 3M, 0.5M, 10M, 4M, 0.1M, 1.2M, 5M And 8M.

The transmission delay (referred to as the delay) refers to the time required for a node to enter a data block from the node to the transmission medium when transmitting data, that is, the time required for a sending site to start sending data frames to the completion of data frame transmission The total time required for a receiving station, or the time required for a receiving station to start receiving data frames and finish receiving data frames.

(1s + 0.8s + 1.5s + 0.4s + 5s + 2s + 0.1s + 0.6s + 3s + 4.4s) = 1.88s.

The training module 205 is configured to train the load classification model.

The process of the training module 205 training the load classification model includes:

The calculation module 206 is configured to calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period.

Specifically, the calculating module 206 calculating the flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period may include:

The recovery module 207 is configured to perform a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period.

The determining module 208 is configured to determine whether a recovery operation is performed on data in all statistical periods of the faulty storage node.

When the judging module 208 determines that the recovery operation is not performed on the data in all the statistical cycles of the failed storage node, it returns to execute the aforementioned identifying module 204.

In summary, in the adaptive data recovery flow control device described in this application, the synchronization module 201 periodically synchronizes information of each storage node in the distributed storage system; the acquisition module 203 detects a storage node in the detection module 202 When a failure occurs, obtain the storage list of the storage node that failed; the identification module 204 identifies the IO load category of the user application in the previous statistical period; the calculation module 206 calculates the corresponding IO load category in the previous statistical period Flow control threshold; the recovery module 207 performs a recovery operation on data in the current statistical period of the failed storage node according to the storage list and the flow control threshold corresponding to the current statistical period, until the failed storage node Perform recovery operations on all data in the statistical period. This application can divide a recovery period into multiple statistical periods. In each statistical period, dynamically adjust the corresponding flow control threshold in the current statistical period according to the IO load category applied by the user in the previous statistical period. Control the threshold to restore the data in the current statistical period. When the IO load of user applications in the previous statistical period is high, reduce the flow control threshold for fault recovery in the current statistical period, so as to reduce the intensity of fault recovery and ensure the business IO load. In the previous statistical period, user applications When the I / O load intensity is low, increase the flow control threshold for fault recovery in the current statistical period, so as to achieve the goal of increasing the fault recovery intensity and recovering the distributed storage system to a healthy state as soon as possible. That is, this application can improve the data recovery efficiency of a large-scale distributed storage system and reduce the risk of data loss, while avoiding a significant impact on normal I / O business performance, and has a good flow control effect.

The above integrated unit implemented in the form of a software functional module may be stored in a non-volatile readable storage medium. The above software function module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a dual-screen device, or a network device) or a processor to execute the embodiments described in this application. Part of the method.

Example three

FIG. 3 is a schematic diagram of an electronic device provided in Embodiment 5 of the present application.

The electronic device 3 includes a memory 31, at least one processor 32, computer-readable instructions 33 stored in the memory 31 and executable on the at least one processor 32, and at least one communication bus 34.

When the at least one processor 32 executes the computer-readable instructions 33, the steps in the foregoing embodiment of the adaptive data recovery flow control method are implemented.

Exemplarily, the computer-readable instructions 33 may be divided into one or more modules / units, and the one or more modules / units are stored in the memory 31 and processed by the at least one processor 32 Execute to complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 33 in the electronic device 3.

The electronic device 3 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art may understand that the schematic diagram 3 is only an example of the electronic device 3, and does not constitute a limitation on the electronic device 3. It may include more or less components than shown in the figure, or some components may be combined or different For example, the electronic device 3 may further include an input / output device, a network access device, a bus, and the like.

The at least one processor 32 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), and application-specific integrated circuits (ASICs). ), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor 32 may be a microprocessor or the processor 32 may be any conventional processor. The processor 32 is a control center of the electronic device 3, and uses various interfaces and lines to connect the entire electronic device 3. The various parts.

The memory 31 may be configured to store the computer-readable instructions 33 and / or modules / units, and the processor 32 may execute or execute the computer-readable instructions and / or modules / units stored in the memory 31, and The data stored in the memory 31 is called to implement various functions of the electronic device 3. The memory 31 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, application programs required for at least one function (such as a sound playback function, an image playback function, etc.), etc .; Data (such as audio data, phone book, etc.) created according to the use of the electronic device 3 are stored. In addition, the memory 31 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD). Card, flash memory card (Flash card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

When the integrated module / unit of the electronic device 3 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the methods of the above embodiments, and can also be completed by computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by a processor, the steps of the foregoing method embodiments can be implemented. The computer-readable instructions may be in a source code form, an object code form, an executable file, or some intermediate form. The non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media. It should be noted that the content contained in the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practices in the jurisdictions. For example, in some jurisdictions, according to legislation and patent practices, non- Volatile readable media does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided in this application, it should be understood that the disclosed electronic device and method may be implemented in other ways. For example, the embodiments of the electronic device described above are merely schematic. For example, the division of the units is only a logical function division, and there may be another division manner in actual implementation.

In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist separately physically, or two or more units may be integrated in the same unit. The integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

It is obvious to a person skilled in the art that the present application is not limited to the details of the above exemplary embodiments, and that the present application can be implemented in other specific forms without departing from the spirit or basic features of the application. Therefore, the embodiments are to be regarded as exemplary and non-limiting in every respect. The scope of the present application is defined by the appended claims rather than the above description, and therefore is intended to fall within the claims. All changes within the meaning and scope of the equivalent requirements are included in this application. Any reference signs in the claims should not be construed as limiting the claims involved. Furthermore, it is clear that the word "comprising" does not exclude other units or that the singular does not exclude the plural. A plurality of units or devices stated in the system claims may also be implemented by one unit or device through software or hardware. Words such as first and second are used to indicate names, but not in any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application and are not limiting. Although the present application has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solution of the present application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of the present application.

Claims

An adaptive data recovery flow control method is characterized in that the method includes:

a) Periodically synchronize the information of each storage node in the distributed storage system;

b) Detect if any storage node has failed;

c) when a failure of a storage node is detected, obtaining a storage list of the failed storage node;

d) identify the IO load category of the user application in the previous statistical period;

e) Calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period;

f) performing a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period;

The foregoing steps d) -f) are repeatedly performed until a recovery operation is performed on data in all statistical periods of the failed storage node.
The method according to claim 1, wherein calculating the flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period comprises:

The preset flow control threshold is used as the flow control threshold corresponding to the first statistical period.
The method according to claim 1, wherein the identifying an IO load category of a user application in a previous statistical period comprises:

Acquiring the data block size of each IO applied by the user in the last statistical period, and calculating the average data block size of the IO in the last statistical period;

Acquiring the transmission delay of each data block in the last statistical period, and calculating the average data block delay of the IO in the last statistical period;

Obtaining a preset reference value of the data block size of the IO and a corresponding reference value of the data block delay;

Calculating the IO according to an average data block size of the IO, the average data block delay, a reference value of the data block size, and a reference value of the corresponding data block delay in the previous statistical period IO load intensity in the last statistical period;

According to the IO load intensity in the last statistical period, a pre-trained load classification model is used to determine the IO load category in the last statistical period.
The method according to claim 1, wherein the IO load category comprises: a high load category, a normal load category, and a low load category, and the calculating the corresponding one of the current statistical cycle according to the IO load category in the previous statistical cycle. Flow control thresholds include:

When the IO load category in the previous statistical cycle is a high load category, reducing the flow control threshold corresponding to the previous statistical cycle by a first preset amplitude to obtain the flow control threshold corresponding to the current statistical cycle;

When the IO load category in the previous statistical cycle is a low load category, increasing the flow control threshold corresponding to the previous statistical cycle by a second preset amplitude to obtain the flow control threshold corresponding to the next statistical cycle;

When the IO load category in the previous statistical cycle is a normal load category, the flow control threshold corresponding to the previous statistical cycle is used as the flow control threshold corresponding to the current statistical cycle.
The method according to claim 3, wherein, according to the average data block size of the IO, the average data block delay, a reference value of the data block size, For the reference value of the corresponding data block delay, a calculation formula for calculating the IO load intensity in the previous statistical period is:
Where X is the average data block size of the IO in the previous statistical period, Y is the average data block delay, M is the reference value of the data block size, and N is the corresponding data block. The benchmark value of the extension.
The method of claim 1, wherein the detecting whether a storage node fails includes:

Detecting whether any one or more storage nodes in the distributed storage system cannot be started, powered off, or disconnected from the network; or

Detect whether a disk in any one or more storage nodes in the distributed storage system has failed.
The method according to any one of claims 1 to 6, wherein the information of each storage node in the synchronous distributed storage system comprises:

A storage center in the distributed storage system performs information synchronization of each storage node; or

Adopting a decentralized method, any one storage node in the distributed storage system initiates information synchronization of each storage node.
An adaptive data recovery flow control device is characterized in that the device includes:

A synchronization module for regularly synchronizing information of each storage node in the distributed storage system;

A detection module for detecting whether a storage node has failed;

An obtaining module, configured to obtain a storage list of a failed storage node when the detection module detects a failure of the storage node;

Identification module, used to identify the IO load category of the user application in the previous statistical period;

A calculation module, configured to calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period;

The recovery module is configured to perform a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period.
An electronic device is characterized in that the electronic device includes a processor and a memory, where the memory is configured to store at least one instruction, and the processor is configured to execute the at least one instruction to implement the following steps:

a) Periodically synchronize the information of each storage node in the distributed storage system;

b) Detect if any storage node has failed;

c) when a failure of a storage node is detected, obtaining a storage list of the failed storage node;

d) identify the IO load category of the user application in the previous statistical period;

e) Calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period;

f) performing a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period;

Repeat the above steps d) -f) until the recovery operation is performed on the data in all the statistical periods of the failed storage node.
The electronic device according to claim 9, wherein the calculating a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period comprises:

The preset flow control threshold is used as the flow control threshold corresponding to the first statistical period.
The electronic device according to claim 9, wherein the identifying the IO load category of the user application in the previous statistical period comprises:

Acquiring the data block size of each IO applied by the user in the last statistical period, and calculating the average data block size of the IO in the last statistical period;

Acquiring the transmission delay of each data block in the last statistical period, and calculating the average data block delay of the IO in the last statistical period;

Obtaining a preset reference value of the data block size of the IO and a corresponding reference value of the data block delay;

Calculating the IO according to an average data block size of the IO, the average data block delay, a reference value of the data block size, and a reference value of the corresponding data block delay in the previous statistical period IO load intensity in the last statistical period;

According to the IO load intensity in the last statistical period, a pre-trained load classification model is used to determine the IO load category in the last statistical period.
The electronic device according to claim 9, wherein the IO load category comprises: a high load category, a normal load category, and a low load category, and the current statistical cycle corresponding to the IO load category is calculated according to the IO load category in the previous statistical cycle The flow control thresholds include:

When the IO load category in the previous statistical cycle is a high load category, reducing the flow control threshold corresponding to the previous statistical cycle by a first preset amplitude to obtain the flow control threshold corresponding to the current statistical cycle;

When the IO load category in the previous statistical cycle is a low load category, increasing the flow control threshold corresponding to the previous statistical cycle by a second preset amplitude to obtain the flow control threshold corresponding to the next statistical cycle;

When the IO load category in the previous statistical cycle is a normal load category, the flow control threshold corresponding to the previous statistical cycle is used as the flow control threshold corresponding to the current statistical cycle.
The electronic device according to claim 11, wherein the reference value based on the average data block size, the average data block delay, and the data block size of the IO in the previous statistical period 2. The reference value of the corresponding data block delay, and the calculation formula for calculating the IO load intensity in the previous statistical period is:
Where X is the average data block size of the IO in the previous statistical period, Y is the average data block delay, M is the reference value of the data block size, and N is the corresponding data block. The benchmark value of the extension.
The electronic device according to claim 9, wherein the detecting whether a storage node fails includes:

Detecting whether any one or more storage nodes in the distributed storage system cannot be started, powered off, or disconnected from the network; or

Detect whether a disk in any one or more storage nodes in the distributed storage system has failed.
A non-volatile readable storage medium stores at least one instruction on the non-volatile readable storage medium, and is characterized in that, when the at least one instruction is executed by a processor, the following steps are implemented:

a) Periodically synchronize the information of each storage node in the distributed storage system;

b) Detect if any storage node has failed;

c) when a failure of a storage node is detected, obtaining a storage list of the failed storage node;

d) identify the IO load category of the user application in the previous statistical period;

e) Calculate a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period;

f) performing a recovery operation on the data in the current statistical period of the storage node that has failed according to the storage list and the flow control threshold corresponding to the current statistical period;

The foregoing steps d) -f) are repeatedly performed until a recovery operation is performed on data in all statistical periods of the failed storage node.
The storage medium according to claim 15, wherein the calculating a flow control threshold corresponding to the current statistical period according to the IO load category in the previous statistical period comprises:

The preset flow control threshold is used as the flow control threshold corresponding to the first statistical period.
The storage medium according to claim 15, wherein the identifying the IO load category of the user application in the previous statistical period comprises:

Acquiring the data block size of each IO applied by the user in the last statistical period, and calculating the average data block size of the IO in the last statistical period;

Acquiring the transmission delay of each data block in the last statistical period, and calculating the average data block delay of the IO in the last statistical period;

Obtaining a preset reference value of the data block size of the IO and a corresponding reference value of the data block delay;

Calculating the IO according to an average data block size of the IO, the average data block delay, a reference value of the data block size, and a reference value of the corresponding data block delay in the previous statistical period IO load intensity in the last statistical period;

According to the IO load intensity in the last statistical period, a pre-trained load classification model is used to determine the IO load category in the last statistical period.
The storage medium according to claim 15, wherein the IO load category comprises: a high load category, a normal load category, and a low load category, and the current statistical cycle correspondence is calculated according to the IO load category in the previous statistical cycle. The flow control thresholds include:

When the IO load category in the previous statistical cycle is a high load category, reducing the flow control threshold corresponding to the previous statistical cycle by a first preset amplitude to obtain the flow control threshold corresponding to the current statistical cycle;

When the IO load category in the previous statistical cycle is a low load category, increasing the flow control threshold corresponding to the previous statistical cycle by a second preset amplitude to obtain the flow control threshold corresponding to the next statistical cycle;

When the IO load category in the previous statistical cycle is a normal load category, the flow control threshold corresponding to the previous statistical cycle is used as the flow control threshold corresponding to the current statistical cycle.
The storage medium according to claim 17, wherein the reference value based on the average data block size, the average data block delay, and the data block size of the IO in the previous statistical period 2. The reference value of the corresponding data block delay, and the calculation formula for calculating the IO load intensity in the previous statistical period is:
Where X is the average data block size of the IO in the previous statistical period, Y is the average data block delay, M is the reference value of the data block size, and N is the corresponding data block. The benchmark value of the extension.
The storage medium of claim 15, wherein the detecting whether a storage node fails includes:

Detecting whether any one or more storage nodes in the distributed storage system cannot be started, powered off, or disconnected from the network; or

Detect whether a disk in any one or more storage nodes in the distributed storage system has failed.