CN114896098B

CN114896098B - Data fault tolerance method and distributed storage system

Info

Publication number: CN114896098B
Application number: CN202210464602.7A
Authority: CN
Inventors: 谭玉娟; 魏鑫蕾; 刘铎; 伍代涛; 吴宇; 陈咸彰
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-05-05
Anticipated expiration: 2042-04-29
Also published as: CN114896098A

Abstract

The invention discloses a data fault tolerance method and a distributed storage system, which distribute a plurality of original data blocks obtained by a file encoding fault tolerance technology to corresponding data nodes, compared with the existing method for immediately generating three copies of the generated original data blocks by adopting a traditional redundancy mechanism, the method can reduce network transmission pressure during initial uploading of the file, and has a good optimizing effect on the problem of data writing and amplifying; when the data node generates potential copies along with the file operation process, inquiring whether the generated potential copies are reserved or not to the management node, and destroying the corresponding potential copies by a file life cycle management strategy set by the management node. The potential copy is generated along with the conventional operation of the system, only the management node is required to inquire whether the copy resource is reserved or not, and the network bandwidth resource is hardly additionally consumed, so that the storage overhead of the system can be reduced while the data restoration performance is ensured.

Description

Data fault tolerance method and distributed storage system

Technical Field

The present invention relates to the field of data storage technologies, and in particular, to a data fault tolerance method and a distributed storage system.

Background

Fault tolerance is one of the important research aspects of distributed storage systems. The existing distributed storage system improves the fault-tolerant capability of the system mainly through copy, erasure codes and traditional mixed fault-tolerant technology, and ensures reliable storage of data.

The copies are formed by copying and storing the data on different nodes, each copy is identical to the original data, when one node fails, the service can be switched to other surviving nodes storing the data, or the data is directly copied from the nodes, so that the reliability and the availability of the service and the data are ensured. Most distributed storage systems default to three-copy fault tolerance techniques such as GFS (Google File System), HDFS (Hadoop distributed file system) and Ceph systems.

Erasure codes can provide higher reliability and lower storage overhead than duplicate techniques. Most large-scale distributed storage cluster storage, such as GFS, HDFS, WAS and other systems, are sequentially introduced with erasure codes to reduce storage cost. Of these, the most common and widely used is the reed-solomon code (reed solomon codes, RS), which is constructed from two configurable parameters k and m, denoted RS (k, m). The RS segments a data block D into k original data blocks with equal size, and generates m check blocks through linear combination coding, wherein k+m blocks form a stripe. In the absence of node failure, the corresponding data block may be read to obtain the original data without additional decoding. However, if one or more blocks are not available, the stripe is degraded, and a maximum of m blocks of missing data can be tolerated, and the missing data can be decoded and reconstructed by any k blocks in the surviving node.

The conventional hybrid fault-tolerant technology refers to a fault-tolerant scheme of mixing erasure codes and multiple copies, namely, after erasure codes are encoded to obtain corresponding data blocks and check blocks, multiple copies are generated immediately and data are distributed. Taking the mixture of RS (3, 2) and three copies as an example, after erasure code coding is adopted to obtain 5 (3+2) data blocks, three copies are constructed for each data block, and 15 copy data are generated. Compared with a single erasure code fault-tolerant scheme, the hybrid fault-tolerant scheme can accelerate data restoration to a certain extent by using the copy, but compared with the single copy fault-tolerant scheme, the data write amplification more serious than the three-copy fault-tolerant technology is caused due to more data required to be stored, and the data write amplification is difficult to be practically deployed in a distributed storage system in a weak network environment.

In summary, the storage overhead of the replica technique and the conventional hybrid fault tolerance technique is very high, which causes serious data write amplification problem, and the data recovery time of the erasure coding method is long, and the recovery performance is inferior to that of the replica technique and the conventional hybrid fault tolerance technique.

Disclosure of Invention

The invention provides a distributed storage system and a data fault tolerance method, which are used for solving the problems of the copy technology, the erasure code technology and the traditional mixed fault tolerance method. The distributed storage cluster can achieve better fault tolerance and read-write and repair performances.

The first aspect of the present invention provides a data fault tolerance method, where the data fault tolerance method is applied to a distributed storage system, and the distributed storage system includes a management node and a plurality of data nodes:

the method comprises the following steps:

after receiving a file writing request sent by a user terminal, the management node selects a plurality of data nodes to store a plurality of original data blocks obtained by a file through an encoding fault tolerance technology;

after the data node generates potential copies in the process of operating the system on the file, inquiring whether the generated potential copies are reserved or not to a management node; the potential copies are original data blocks obtained by the file through an encoding fault tolerance technology or original data block copies generated in the process of operating the file with a system;

the management node determines whether to reserve potential copies generated by the data node sending the query according to a preset file life cycle management strategy, and returns a query response result of the data node sending the query; the file life cycle management strategy is to manage the life cycle of each potential copy according to the initial generation time and the cold and hot degree of the potential copy.

A second aspect of the present invention provides a distributed storage system comprising a management node and a plurality of data nodes:

the management node is used for selecting a plurality of data nodes to store a plurality of original data blocks obtained by the file through the coding fault tolerance technology after receiving a file writing request sent by the user terminal;

the data node is used for inquiring whether the generated potential copy is reserved or not to the management node after the potential copy is generated in the process of operating the system on the file; the potential copies are original data blocks obtained by the file through an encoding fault tolerance technology or original data block copies generated in the process of operating the file with a system;

the management node is also used for determining whether to reserve potential copies generated by the data node sending the query according to a preset file life cycle management strategy, and returning a query response result of the data node sending the query; the file life cycle management strategy is to manage the life cycle of each potential copy according to the initial generation time and the cold and hot degree of the potential copy.

Compared with the prior art, the data fault tolerance method and the distributed storage system provided by the invention have the following beneficial effects:

According to the method, the plurality of original data blocks obtained by the file through the coding fault tolerance technology are distributed to the corresponding data nodes, and compared with the existing method for immediately generating three copies of the generated original data blocks by adopting the traditional redundancy mechanism, the method can reduce network transmission pressure during initial uploading of the file, and has a good optimizing effect on the problem of data write amplification; when the data node generates potential copies along with the file operation process, inquiring whether the generated potential copies are reserved or not to the management node, and destroying the corresponding potential copies by a file life cycle management strategy set by the management node. The potential copy is generated along with the conventional operation of the system, only the management node is required to inquire whether the copy resource is reserved or not, and the network bandwidth resource is hardly additionally consumed, so that the storage overhead of the system can be reduced while the data restoration performance is ensured.

Drawings

FIG. 1 is a graph of write performance of a replica and RS erasure codes at different upper limits of the exit bandwidth provided by the present invention;

FIG. 2 is a graph of repair performance for three copies and RS codes of different coding configurations provided by the present invention;

FIG. 3 is a flow chart of one embodiment of a data fault tolerance method provided by the present invention;

FIG. 4 is a diagram of a hierarchical management architecture for files involved in a data fault tolerance method provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of an aging algorithm involved in a data fault tolerance method according to an embodiment of the present invention to manage potential duplicate access;

FIG. 6 is a block diagram of another embodiment of a data fault tolerance method of the present invention;

FIG. 7 is a graph of memory overhead versus different redundancy mechanisms provided by the present invention;

FIG. 8 is a graph comparing actual network traffic flows written by different redundancy mechanisms under different upper limits of egress bandwidths;

FIG. 9 is a graph comparing the probability of degraded reading under the condition of different node fault numbers of the RS (3, 2) and HFPR with different level ratios;

FIG. 10 (a) is the time required for the copy, erasure code and HFPR to repair a block provided by the present invention;

FIG. 10 (b) shows the copy hit rate of HFPR with different hierarchical ratios at different hot and cold data read ratios according to the present invention;

fig. 11 is a diagram showing a data transmission success rate of the conventional method and HFPR according to the present invention under the condition of different transmission data block numbers;

fig. 12 is a comparison chart of permanent failure probabilities of data of storage files by adopting different fault-tolerant schemes under the condition that 10-30 cluster nodes are disconnected.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Before implementing the application, in order to analyze the problems existing in the existing fault-tolerant technology, the applicant adopts two fault-tolerant technologies with the same fault-tolerant capability, namely three copies and RS (3, 2), to perform experiments in a weak network environment, and the experimental results show that the existing fault-tolerant technology has the following problems.

(1) Copy fault tolerance technique write amplification severity

In a weak network environment, a large amount of data is transmitted in a short time by copy writing operation, and serious data write amplification problem is generated. The results of fig. 1 show that the problem of write amplification occurs in the write operation of files with the same size in both fault-tolerant technologies, the write amplification is 1.44-2.6 times by adopting the erasure code fault-tolerant technology, the write amplification is 1.66-3.7 times by adopting the three-copy fault-tolerant technology, the write amplification and amplification of erasure codes are more gentle along with the lower and lower bandwidth of node outlets, and the write amplification of the three-copy fault-tolerant technology is rapidly increased along with the lower bandwidth.

The serious data write amplification is caused by heavy load brought by large amount of data distribution to network links, network bandwidth is low, signal fluctuation is large in a weak network environment, link congestion is frequent, reconnection disconnection phenomenon is serious, and the actual network transmission flow generated by write operation is far higher than the theoretical data volume. The data with the same size is stored, the data quantity required to be transmitted by the write operation adopting the three-copy fault-tolerant technology is 80% more than that of the data adopting the erasure code fault-tolerant technology, and the network performance is reduced along with the increase of the traffic load, so that the data write amplification problem of the copy fault-tolerant technology is more serious. The data write amplification not only affects the system performance, but also may cause data distributed storage failure, the system is difficult to achieve redundancy targets, and the data storage reliability is low.

(2) Repair time of erasure code fault-tolerant technology is long

In theory, the time for data repair by adopting erasure code fault tolerance technology is long, the repair performance is not ideal, and the problem is further aggravated in a weak network environment. The result of fig. 2 shows that the repair is performed in units of blocks, and the repair overhead increases linearly with the increase of the erasure code k value; the copy fault-tolerant technology only needs to read a single block from one node, and in multiple experiments, the repair time is stable at about 1.8s, so that the copy fault-tolerant technology has good repair performance. When the erasure code fault-tolerant technology is adopted to repair data, a plurality of data blocks are required to be obtained for decoding reconstruction, the time for erasure code repair is k times of that for copy repair in theory, but in the actual transmission process, experimental results show that the repair time for erasure code is 1.5 k-1.9 k times of that for copy, and the erasure code repair is due to the fact that in a weak network environment, the erasure code repair is affected by a low bandwidth network, node connectivity is poor, and the situation of failed reconnection frequently occurs. Although an erasure code redundancy mechanism is adopted, only K data blocks are theoretically needed to be obtained for data repair, the number of nodes actually connected and the transmitted data quantity are far larger than the theoretical K value due to the influence of failed reconnection, the data transmission time and the repair time are long, and the requirement of fast repair of a weak network distributed storage cluster cannot be met.

(3) The traditional mixed fault-tolerant scheme has large storage overhead and serious write amplification

The traditional mixed fault-tolerant scheme refers to a fault-tolerant scheme of mixing erasure codes and multiple copies, namely, after corresponding data blocks and check blocks obtained by erasure codes are coded, multiple copies are generated immediately and data are distributed. Compared with a single erasure code fault-tolerant scheme, the hybrid fault-tolerant scheme can accelerate data restoration to a certain extent by using copies, but nodes storing the copies can not be normally connected under a mobile environment due to unstable network signals; compared with a single copy fault-tolerant scheme, the data write amplification which is more serious than the three-copy fault-tolerant technology is caused by more data to be stored, and the data write amplification is difficult to be actually deployed in a distributed storage system in a weak network environment.

According to the analysis of the results, in the weak network mobile environment, the existing fault tolerance technology cannot perform high-efficiency data distributed storage and data restoration at the same time, and the system data storage reliability and performance are both provided with a large short board, so that the system data storage method cannot be directly applied to the weak network mobile environment.

In view of this, the present application proposes a data fault tolerance method, which can simultaneously consider the system storage performance and the data storage reliability under various network conditions, and particularly has more remarkable effect on data storage in a mobile environment facing a weak network. According to the method provided by the application, firstly, the network pressure during data writing is reduced by adopting an erasure code fault tolerance technology, and then, the data redundancy is added in an incremental mode through potential copies, so that the fault tolerance capability of the system is improved; after the data is lost, the potential copy is used for fast reading and repairing the data, so that the repairing performance is optimized.

Specifically, the data fault tolerance method provided by the embodiment of the invention can be applied to a distributed storage system, wherein the distributed storage system comprises a management node and a plurality of data nodes:

as shown in fig. 3, the method includes steps S11 to S13:

s11, after receiving a file writing request sent by a user terminal, the management node selects a plurality of data nodes to store a plurality of original data blocks obtained by a file through an encoding fault tolerance technology; the coding fault tolerance technique may be erasure code coding, for example. Typically, for erasure coding techniques, (k+m) data nodes are selected for distributed storage.

S12, after the data node generates potential copies in the process of operating the system on the files, inquiring whether the generated potential copies are reserved or not to a management node; the potential copies are original data blocks obtained by the file through the coding fault tolerance technology or original data block copies generated in the process of operating the file through the system.

S13, the management node determines whether to keep potential copies generated by the data node sending the query according to a preset file life cycle management strategy, and returns a query response result of the data node sending the query; the file life cycle management strategy is to manage the life cycle of each potential copy according to the initial generation time and the cold and hot degree of the potential copy.

Specifically, in the process of data collection, reading and repairing of the data nodes, the data generated on each data node can be used as a common resource of a cluster to be accessed by any other data node, and in the application, such a resource which is usually destroyed in the memory is called a potential copy, and in the application, the original data block is a potential copy. For example, the original data block 1 obtained by erasure code encoding the file distributed by the data node 1 from the user terminal is a potential copy; data node 2 reads data block 1 from data node 1 and replicates the resulting copy of data block 1, also as a potential copy.

It should be noted that the embodiment of the present invention adopts an incremental redundancy technique. Incremental redundancy is defined as that after a file is stored in a distributed mode by adopting an erasure code fault tolerance technology, multiple potential copies are generated in a data node along with data acquisition, reading and repairing operations of the data node, and the system incrementally keeps the potential copies in the data node so as to gradually improve data redundancy. The system write operation generally takes a file as a unit, according to the traditional copy fault-tolerant technology and the traditional mixed fault-tolerant method, the file is assumed to be encoded by adopting RS (3, 2) erasure codes, after encoding, a system adopting a traditional redundancy mechanism immediately generates three copies for 5 data blocks and check blocks, and 15 blocks are distributed simultaneously in a short time; and after the incremental redundancy is subjected to erasure code coding, no additional copy is added immediately, only 5 original data blocks obtained by erasure code coding are distributed, and in the subsequent system operation process, the obtained data is reserved as potential copies at the data nodes during file reading and repairing, so that the data redundancy is improved incrementally.

In the embodiment of the invention, the network transmission pressure during the initial uploading of the file can be reduced to a great extent by the incremental redundancy method, the write amplification problem of the data is well optimized, the potential copies are generated along with the conventional operation of the system, and only the management node needs to be inquired whether to reserve the copy resources or not, so that the network bandwidth resources are hardly consumed additionally, and the method is very friendly to the weak network mobile distributed storage cluster.

In the embodiment of the invention, the number and the life cycle of the potential copies are controlled by setting the file life cycle management strategy at the management node, so that the problem of extremely large storage space consumption caused by that only the potential copies are generated in the existing cluster and the potential copies are not cleared can be solved.

In addition, the potential copy and the hybrid fault-tolerant method based on the potential copy are independent modules, are not limited to a certain system, have good portability, and the coding fault-tolerant technology in the hybrid fault-tolerant scheme can be selected according to the needs and is very flexible in configuration.

In an alternative embodiment, the file lifecycle management policy is based on a file hierarchy in which a new layer, a next-new layer, and an old layer are included; each level is preset with a corresponding level copy storage capacity threshold;

the management node manages the life cycle of each potential copy according to the initial generation time and the cold and hot degree of the potential copy, and specifically includes:

and the management node dynamically adjusts the level of the potential copy according to the initial generation time, the cold and hot degree and the storage capacity of the level copy corresponding to each level, and manages the life cycle of the potential copy.

It should be noted that, in the embodiment of the present invention, the storage capacity thresholds of the copies of each level are specific to a single data block, that is, if the file 1 is subjected to erasure coding to obtain 5 data blocks (data blocks 1-5 respectively), and the storage capacity thresholds of the copies of the new layer, the next new layer and the old layer are 3, 2 and 1 in sequence, the number of potential copies corresponding to the data block 1 can be placed by the new layer is 3, and the data blocks 2-5 are the same.

In one embodiment, the management node dynamically adjusts a hierarchy in which the potential copy is located according to an initial generation time, a hot and cold degree of the potential copy and a storage capacity of the hierarchy copy corresponding to each hierarchy, and manages a life cycle of the potential copy, which specifically includes:

for newly generated potential copies or potential copies with level migration, the management node detects whether the number of potential copies of each level reaches a corresponding level copy storage threshold; if yes, destroying the new potential copy or the potential copy with the hierarchical migration; otherwise, placing the newly generated potential copy or the potential copy subjected to the level migration in the corresponding level; wherein the initial position of the newly generated potential copy is placed in the new layer;

The management node sets an activity detector to trigger the layer-lifting operation of the potential copies, detects the accessed times of the potential copies in a history period of time in the activity detector, migrates the potential copies in the old layer with the accessed times reaching a preset times threshold to a next new layer, and migrates the potential copies in the next new layer with the accessed times reaching the times threshold to the new layer;

the management node reduces the potential copies reaching the timer time to the next new layer when detecting that the number of the potential copies at the new layer reaches a new layer copy storage threshold; meanwhile, when the number of times of accessing the potential copy subjected to the degradation in a history period reaches the threshold value, resetting a timer of the potential copy subjected to the degradation to delay the time of the potential copy subjected to the degradation remaining in a new layer; wherein the timer for each potential copy is generated when the corresponding potential copy is generated;

and the management node migrates the potential copies in the next new layer which are not accessed in the history period to the old layer, or migrates the potential copies in the next new layer, which are accessed in the history period for the least number of times, to the old layer when the management node detects that the number of the potential copies in the next new layer reaches the copy storage threshold of the next new layer.

Further, the management node detects the accessed times of the potential copies in a historical period of time through the activity detector, migrates the potential copies in the old layer with the accessed times reaching a preset time threshold to a next new layer, migrates the potential copies in the next new layer with the accessed times reaching the time threshold to a new layer, and comprises the following steps:

the management node monitors the aging variable of each potential copy in the next new layer, and forms a plurality of potential copies into a minimum heap according to the aging variable of each potential copy; wherein each node in the minimum heap is configured to represent an aging variable for each potential copy; each bit of the aging variable marks whether a potential copy has been accessed within one cycle of the history;

the management node detects the accessed times of all leaf nodes of the minimum heap in a historical period of time through an activity detector, and potential copies corresponding to the leaf nodes with the accessed times larger than the time threshold are migrated to a new layer;

and the management node detects the accessed times of all the potential copies in the old layer in a historical period of time through the activity detector, and migrates the potential copies in the old layer, the accessed times of which reach the time threshold value, to a next new layer.

For ease of understanding, reference is made to FIG. 4, which is a schematic diagram of file hierarchy management provided by an embodiment of the present invention, illustrating the overall hierarchy architecture, key technology modules per layer, and conditions for potential copy migration between tiers.

(1) Integral layered architecture

In the implementation of the invention, when the management node performs hierarchical management on the potential copy, the potential copy is divided into a new layer, a next-new layer and an old layer. The hierarchy is divided primarily to control the number of potential copies per data block. For example, the copy storage capacity threshold of the new layer may be set to 3, the copy storage capacity threshold of the next new layer is set to 2, and the copy storage capacity threshold of the old layer is set to 1, that is, the original data block and the check block obtained by erasure code encoding the original file. Each layer is provided with a certain upper limit of storage capacity, for example, the data stored in the new layer is not more than 20% of the whole storage capacity, the data in the next new layer is not more than 60% of the whole storage capacity, and the rest data are all in the old layer so as to control the utilization rate of the storage space of the whole system.

(2) Hierarchical management process

The initial positions of the newly generated potential copies are all located in the new layer, and two degradation steps are generated from the new layer to the old layer, wherein the same degradation condition is that the number of potential copies of the current level is larger than a set copy storage capacity threshold, and the difference is that: 1) The judging condition of lowering from the new layer to the next new layer is the initial generation time of the potential copy, the time of the timer is usually set to be shorter, the situation that the threshold value of storing the level copy is quickly reached is avoided, meanwhile, in order to prevent the potential copy which reaches the time threshold value and is accessed frequently from lowering the layer to influence the system performance, a filter is used for checking the potential copy in the lowering process, if the potential copy is accessed frequently, the timer is reset, and the time of the potential copy left in the new layer is prolonged; 2) The determining factor from the next new layer to the old layer is the coldness and the warmth of the potential copies, i.e. the number of accesses, and the potential copies which are not accessed in a certain period of time are reduced to the old layer, and only one potential copy is reserved.

The upscaling operation is triggered by an activity detector that records the number of times a potential copy has been accessed over a period of time, and upscaling the potential copy if a threshold is reached. The objects detected by the module at different levels are slightly different: at the next new layer, the activity detector only needs to detect all leaf nodes of the minimum heap; at the old level, all potential copies need to be monitored.

(3) File aging management

File aging management is divided into two steps: 1) The aging algorithm monitors access to potential copies: maintaining an N-bit aging variable for each potential copy, each bit indicating whether it was accessed during the past one time period (1 being accessed, 0 being not accessed), each new time period, the variable performing a right shift by one bit; 2) Sorting files: the files are divided into cold and hot data by taking the value 0 of the aging variable as a limit, and the value 0 is cold data and the value not 0 is hot data.

Fig. 5 illustrates the process of the aging algorithm recording potential copy access. It is assumed that 4 time periods have elapsed during which the potential copies listed in the figure are subject to aging management. The minimum heap is traversed periodically, the potential copy 2 aging variable is found to be 0, the system determines it to be cold data, downgrades it and updates the heap. If there are multiple potential copies with aging variables of 0, the destaging and updating operations are repeated until the top of the heap potential copy has been accessed in the past N cycles.

Based on the technical scheme provided by the embodiment, the embodiment of the invention carries out layered management on the potential copies according to the generation time and the cold and hot degree of the potential copies by designing the file life cycle management strategy, the copy storage overhead is strictly controlled on the premise of ensuring the better system reliability and operation performance, the proportion of the layered storage capacity of the layered management can be dynamically adjusted, and a user can flexibly set according to own use requirements.

Further, the impact of data transmission rates between data nodes on data distribution and utilization of potential replicas are considered, especially in a weak network environment. On one hand, in a weak network environment, the data transmission success rate among data nodes seriously affects the data distribution and repair of a system, and the lower data transmission success rate easily causes low reliability of data storage; on the other hand, the operations such as data acquisition, reading and repairing are performed by taking the data node as a unit, a plurality of obtained potential copies are stored on one data node, the state of the data node where the potential copies are located has great correlation with the copy utilization rate, if the data node is in an environment with poor network bandwidth for a long time, the data node is frequently disconnected and even has great fault damage risk, so that the probability that the copies on the data node can provide services is small, storage resources are wasted, and positive effects cannot be generated on the repairing process, therefore, the data node with the better node state is selected as a server to store the copies, so that services can be provided as stably as possible, and the waste of the copy resources is avoided.

Thus, in one embodiment, the step S12 "the management node selects a plurality of data nodes to store a plurality of original data blocks obtained by erasure code encoding of a file after receiving a file writing request sent by a user terminal", specifically includes:

after receiving a file writing request sent by a user terminal, the management node selects a plurality of data node storage files with state prediction results conforming to preset state conditions according to the state prediction results of each data node to obtain a plurality of original data blocks through erasure code coding.

Specifically, for the erasure code coding scheme, the management node stores a plurality of original data blocks on the (k+m) data nodes, and then selects the first (k+m) data node with the best state prediction result for storage.

In one embodiment, the state prediction result of each data node is obtained by comprehensively considering the normal operation probability of the data node and the load of the data node.

That is, in the embodiment of the present invention, the data layout is performed by comprehensively considering the failure probability and the load of the data node, and the state prediction result of the data node i uses H _i Represented by formula H _i ＝P[n _i ]×K _i Wherein P represents the probability of normal operation of the data node, K represents the availability of the data node, and the lower the load, the higher the availability.

In one embodiment, the probability of normal operation of the data node is calculated according to the following formula:

wherein, P [ n ] _i ]For the normal operation probability of the ith data node,

and->

Representing the probability of failure of the ith data node due to network factors and due to environmental factors, respectivelyProbability of failure, T _off Representing the expected time for the data node to temporarily disconnect from the network, T _{last_off} For the time elapsed since the last data node connection failed, k is a parameter, λ represents the data node at (T-T _{last_off} ) By (t+t) the number of inaccessible times is expected, f (k; λ) is a poisson distribution formula, and T represents the time from T to run.

In the embodiment of the invention, when calculating the normal operation probability of the data node, the following factors possibly causing faults are considered: temporary disconnection of network and probability of damage under different circumstances, and assuming these events occur independently, there are

For failures caused by network disconnection: in a mobile distributed application scenario, the probability of a data node disconnecting from a network is related to the movement behavior of the data node. The data nodes are divided into a plurality of groups, each group is internally provided with a central node, and other data nodes follow the central node and move according to a certain movement track. At present, more researches on network connectivity are carried out, and the poisson distribution is mostly adopted for simulation, and the method is as follows: t (T) _off An expected time for the temporary disconnection of the network for the data node, which time can be estimated based on a record of historical network disconnection, T _{last_off} Lambda represents the number of times the data node is predicted to be inaccessible between (T-Tlast_off) and (t+T), expressed by

The calculation is available. The probability of failure of the node due to the network connection disconnection can then be calculated by: />

For malfunctions caused by environmental factors: the demands of the work tasks are different, and the degree of danger faced by the data nodes is different, for example, in earthquake disasterAmong the hazards, the risk factor is lower when the data node is at open level ground, and the risk factor is higher near the building. Such a probability of failure based on a specific environment is known, and no additional calculation is required, so that

Seen as a known constant.

In one embodiment, the load of the data node is represented by the availability of the data node, and the availability of the data node is calculated by the following formula:

/>

wherein K is the availability of the data node, E is an attribute set affecting the availability of the data node, omega _e Weight value, beta, of attribute e of data node _e The method comprises the steps of providing a CPU (Central processing Unit) of a data node, wherein the CPU is the CPU of the data node, the io is the disk throughput of the data node, the memory is the memory of the data node, and the storage is the storage space of the data node.

That is, in the embodiment of the present invention, the availability of the data node is affected by the attributes of the node CPU, disk throughput, memory and storage space,

the impact of network factors has been estimated and is therefore no longer taken into account in the availability of data nodes. Each attribute has a corresponding weight, and the scores of the indexes are added to obtain the comprehensive availability score of the data node i.

The weight of each factor can be dynamically adjusted under different scenes to adapt to the scene transformation, so as to achieve a more accurate prediction effect. If in a mobile scene, the influence of the network on the system operation is obviously larger, the fluctuation of the network signal and the overall low-bandwidth environment can have larger influence on data transmission, so that the weight of network factors can be amplified during calculation.

According to the embodiment of the invention, through the prediction result of the data node state, the system can select the data node with better state to place the data block, reduce the situation that reconnection is disconnected or even transmission cannot be successfully performed when data is written, improve the success rate of data transmission, and enable the potential copy stored on the data node to provide the service of quick reading and repairing stably for a long time as much as possible.

In summary, embodiments of the present invention provide a potential replica-based hybrid fault tolerance scheme, with an overall architecture diagram see fig. 6. According to the method, the distribution and layout of the data are guided according to the prediction result through a node state prediction mechanism, the success rate of data transmission and the utilization rate of potential copies are greatly improved, a file life cycle management strategy is set at a management node, the data are managed in a layered mode according to the generation time and the cold and hot degree of the data, and the copy storage overhead is strictly controlled on the premise that the reliability and the operation performance of the system are better.

Correspondingly, the embodiment of the invention also provides a distributed storage system, which comprises a management node and a plurality of data nodes:

Preferably, the file life cycle management policy set by the management node is based on a file layered architecture, and in the file layered architecture, a new layer, a next new layer and an old layer are included; each level is preset with a corresponding level copy storage capacity threshold;

It should be noted that, the distributed storage system provided in the embodiment of the present invention executes all the steps and processes of the data fault tolerance method, and the working principles and beneficial effects of the two correspond to each other one by one, which is not repeated here.

In order to better embody the beneficial effects of the technical scheme, the applicant performs experiments on the aspects of storage overhead, system reliability, data storage reliability and the like of different redundancy mechanisms. In the experiment, the experiment controls the total storage overhead of the cluster by setting the storage capacity proportion (new: next new: old) of different levels. As can be seen from the comparison of the storage overhead of the different redundancy schemes shown in fig. 7, the higher the file duty ratio of the old layer, the smaller the overall storage overhead limit, (HFPR in the figure refers to the fault tolerance technology proposed in the present application, and (1:5:4) under HFPR refers to the ratio of the storage capacities of three levels of potential copies), and in the three different HFPR schemes proposed in the present application, the storage overhead is increased by 11.11% at most compared with three copies. Theoretically, the higher the memory overhead, the better the system performance. The storage overhead value in the graph is a theoretical upper limit calculated according to the set proportion of each layer of file, in the actual operation process, a plurality of potential copies in the whole cluster are not read, only one potential copy exists in the uploading node, and after a period of time, the copy of the uploading node is recovered if the copy is not accessed. Therefore, the storage space ratio of the new layer and the next new layer is likely to be far lower than the set value under the real running condition, the actual storage cost is smaller than the theoretical value, and the user can achieve the effects of quick reading and repairing with lower cost.

FIG. 8 shows the actual network traffic for three copies, RS (3, 2) and HFPR transmitting 6M files over the node bandwidth range of 1MB/s to 3MB/s to evaluate the data write amplification for different strategies. It can be seen that the write amplification of HFPR is only 1.09-2 times, reduced by 0.35-0.6 times relative to RS (3, 2), and reduced by 0.57-1.7 times relative to three copies. The HFPR is added with a node state prediction mechanism, so that on the premise of meeting the cross-group and nearby data layout principles, the node with better state is selected for distributed storage of data, unnecessary reconnection and retransmission among the nodes are reduced, the transmission success rate is improved, and the network bandwidth resource can be better utilized.

Fig. 9 shows the probability of degraded reading of the RS (3, 2) and HFPR of different hierarchical proportions under the condition of different numbers of node faults. The experimental results show that: the degradation reading probability is proportional to the number of node faults, inversely proportional to the storage overhead, and compared with the RS (3, 2) erasure code fault tolerance technology, the HFPR scheme can greatly reduce degradation reading generated by a system. This is because there are a large number of potential copies in the cluster employing HFPR, which improves the redundancy of the data, and has better fault tolerance to multi-node failures, thus reducing many degraded reads.

Fig. 10 (a) shows the time required for copy, erasure code and HFPR to repair a block. The result shows that the HFPR repair time is far lower than the erasure code repair time, is an order of magnitude with the duplicate repair, is better than the effect of the three-copy repair, and reduces the repair delay by more than 21%. The HFPR is added with node state prediction, and can select nodes with higher probability of successfully acquiring data according to node state information during repair, so that the situation that the reading is unsuccessful and re-requests to other nodes with copies are needed is reduced.

Fig. 10 (b) shows HFPR at different levels of scale at different hot and cold data read rates. As can be seen from the figures: the HFPR scheme can quickly repair 92.73% of data through the potential copy, and the higher the ratio of the new layer to the next new layer, the more frequent the access to hot data will result in a higher copy hit probability. The reason is that the higher the ratio of new and next new layers, the more files that represent potential copies may exist in the system. The reason why the proportion of cold and hot data access affects the hit rate of the copy is as follows: the possibility of frequently accessed thermal data is smaller, thermal data in the cluster can be kept in a new layer and a next new layer for a longer time, and the number of potential copies of the file is larger; if the access ratio of the cold data is larger, the potential copies can be frequently migrated among different layers, which is equivalent to frequently updating the cache, and the hit rate of the copies is reduced.

Fig. 11 shows p of conventional method and HFPR under different transmission data block numbers _m (success rate of transmission, in this application,

). The traditional method specifically refers to randomly selecting nodes for transmission when writing data, and sequentially reading all data blocks and check blocks when reading or repairing. The results show that p of the conventional method _m Stabilized at 0.65 and added to HFPR, p of data node state prediction _m Being able to stay above 0.9 is also a benefit of data node state prediction: by selecting the data nodes with better states, better connectivity exists among the data nodes.

The data permanent failure probability is the probability of completely losing one file under the condition of a plurality of nodes of the cluster, and is an important index for measuring the reliability of data storage. Fig. 12 shows the permanent failure probability of reading data of a file stored by adopting different fault-tolerant schemes under the condition that cluster nodes are disconnected by 10 to 30. The result shows that HFPR has redundancy relative to RS (3, 2), and the HFPR can at least reduce the probability of data permanent failure by half, and the higher the ratio of a new layer to a next new layer is, the lower the data permanent failure probability is. And the HFPR cluster has a large number of potential copies, so that the data redundancy is high, the fault tolerance performance is better, and the data storage reliability is improved.

The high data transmission success rate enables the system to store and repair data in a timely and efficient mode, ensures redundancy of the system, tolerates multi-node failure, and is lower in permanent failure probability.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. A data fault tolerance method, wherein the data fault tolerance method is applied to a distributed storage system, and the distributed storage system comprises a management node and a plurality of data nodes:

the method comprises the following steps:

The management node determines whether to reserve potential copies generated by the data node sending the query according to a preset file life cycle management strategy, and returns a query response result of the data node sending the query; the file life cycle management strategy is used for managing the life cycle of each potential copy according to the initial generation time and the cold and hot degree of the potential copy;

the file life cycle management strategy is based on a file layered architecture, wherein the file layered architecture comprises a new layer, a secondary new layer and an old layer; each level is preset with a corresponding level copy storage capacity threshold;

the management node dynamically adjusts the level of the potential copy according to the initial generation time, the cold and hot degree and the storage capacity of the level copy corresponding to each level, and manages the life cycle of the potential copy;

the management node dynamically adjusts the level of the potential copy according to the initial generation time, the cold and hot degree and the storage capacity of the level copy corresponding to each level, and manages the life cycle of the potential copy, and the method specifically comprises the following steps:

2. The data fault tolerance method of claim 1, wherein the managing node detecting, in the activity detector, a number of times the potential copy is accessed in a history period, migrating the potential copy in the old layer having the number of times of access reaching a preset number of times threshold to the next new layer, and migrating the potential copy in the next new layer having the number of times of access reaching the number of times threshold to the new layer, includes:

3. The method for data fault tolerance according to claim 1, wherein the management node selects a plurality of data nodes to store a plurality of original data blocks obtained by the file through the encoding fault tolerance technology after receiving the file writing request sent by the user terminal, specifically comprising:

after receiving a file writing request sent by a user terminal, the management node selects a plurality of original data blocks obtained by encoding fault-tolerant technology of a plurality of data node storage files, wherein the state prediction result of the data node storage files accords with a preset state condition according to the state prediction result of each data node.

4. A data fault tolerance method according to claim 3, wherein the state prediction result of each data node is obtained by comprehensively considering the normal operation probability of the data node and the load of the data node.

5. The data fault tolerance method of claim 4, wherein the probability that the data node is operating properly is calculated according to the following formula:

wherein, P [ n ] _i ]For the normal operation probability of the ith data node,

and->

Respectively representing the fault probability of the ith data node caused by network factors and the fault probability caused by environment factors, T _off Representing the expected time for the data node to temporarily disconnect from the network, T _{last_off} For the time elapsed since the last data node connection failed, k is a parameter, λ represents the data node at (T-T _{last_off} ) By (t+t) the number of inaccessible times is expected, f (k;λ) is a poisson distribution formula, and T represents the time from T to run.

6. The data fault tolerance method of claim 4, wherein the load of the data node is represented by an availability of the data node, and the availability of the data node is calculated by the formula:

7. A distributed storage system comprising a management node and a plurality of data nodes:

the management node is also used for determining whether to reserve potential copies generated by the data node sending the query according to a preset file life cycle management strategy, and returning a query response result of the data node sending the query; the file life cycle management strategy is used for managing the life cycle of each potential copy according to the initial generation time and the cold and hot degree of the potential copy;

the file life cycle management strategy set by the management node is based on a file layering architecture, wherein the file layering architecture comprises a new layer, a next new layer and an old layer; each level is preset with a corresponding level copy storage capacity threshold;