CN110737924A - method and equipment for data protection - Google Patents

method and equipment for data protection Download PDF

Info

Publication number
CN110737924A
CN110737924A CN201810803843.3A CN201810803843A CN110737924A CN 110737924 A CN110737924 A CN 110737924A CN 201810803843 A CN201810803843 A CN 201810803843A CN 110737924 A CN110737924 A CN 110737924A
Authority
CN
China
Prior art keywords
fault
domains
distributed storage
storage cluster
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810803843.3A
Other languages
Chinese (zh)
Other versions
CN110737924B (en
Inventor
李宏杰
张绍文
郭占东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongchang (suzhou) Software Technology Co Ltd
China Mobile Communications Group Co Ltd
Original Assignee
Zhongchang (suzhou) Software Technology Co Ltd
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongchang (suzhou) Software Technology Co Ltd, China Mobile Communications Group Co Ltd filed Critical Zhongchang (suzhou) Software Technology Co Ltd
Priority to CN201810803843.3A priority Critical patent/CN110737924B/en
Publication of CN110737924A publication Critical patent/CN110737924A/en
Application granted granted Critical
Publication of CN110737924B publication Critical patent/CN110737924B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/70Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
    • G06F21/78Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data
    • G06F21/79Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data in semiconductor storage media, e.g. directly-addressable memories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses data protection methods and devices, which are used for solving the problems that in the prior art, after a fault domain fails, the performance of a distributed storage cluster is reduced and the operation of a service is influenced.A distributed storage cluster can subdivide hard disks in the fault domain which does not fail to obtain at least new fault domains, reselects the hard disks in data fragmentation distribution in the new fault domain according to an initial data fragmentation distribution rule, and performs recovery and rebalancing in the reselected hard disks.

Description

method and equipment for data protection
Technical Field
The invention relates to the technical field of computers, in particular to data protection methods and equipment.
Background
With the explosive growth of data, the traditional data storage mode cannot meet the requirement, and distributed storage becomes the first choice. Distributed storage includes several classes of distributed block storage, distributed file storage, and distributed object storage.
The existing distributed storage cluster usually adopts a data redundancy mode to carry out data protection, and typical data redundancy modes comprise a copy strategy and an erasure code strategy, wherein the copy strategy or the erasure code strategy can generate data fragments (such as copies and erasure code check blocks), the existing distributed storage cluster adopts a fault domain dividing method to place the data fragments in different fault domains by establishing corresponding distribution rules, and the fault domain division can be artificially divided based on a physical topology, for example, machine rooms are fault domains, so that when fault domains are failed and become unavailable, data can not be lost.
In the prior art, after fault domain division and data fragmentation cross-fault domain distribution rule division are adopted for a distributed storage cluster, if a physical topology in the distributed storage cluster changes (no matter fault occurs, capacity expansion and capacity reduction occur), the original fault domain division and data fragmentation cross-fault domain distribution rule cannot meet the requirements, so that the performance of the distributed storage cluster is reduced, and the service operation is influenced.
Disclosure of Invention
The invention provides data protection methods and devices, which are used for solving the problems that after a fault domain fails, the performance of a distributed storage cluster is reduced and the operation of a service is influenced in the prior art.
, after the hard disks in the non-failed domain in the distributed storage cluster are repartitioned to obtain at least new failure domains, the hard disks distributed with data fragments are reselected in the new failure domains according to the initial data fragment distribution rule, and the data fragments in the failed domain are recovered and rebalanced in the reselected hard disks.
In the embodiment of the invention, after the fault domain fails, the distributed storage cluster can timely re-partition the hard disks in the fault domain which does not fail to obtain the new fault domain, and then recover the lost data based on the re-partitioned fault domain, thereby reducing the probability that the performance is reduced and the service cannot normally run because the fault domain fails and cannot be processed in time.
In a specific implementation, before repartitioning the hard disks in the non-failed failure domain in the distributed storage cluster to obtain at least new failure domains, it is necessary to determine that the hard disk parameter values in at least failure domains in the distributed storage cluster do not satisfy the preset range of the preset hard disk parameter values, or determine that the failure rate of at least failure domains in the distributed storage cluster reaches the th preset value, where the failure rate may be determined in the following manner:
the method comprises the steps of obtaining operation indexes of at least fault domains of the distributed storage cluster in an operation state, and determining the fault rate of at least fault domains according to the operation indexes.
In the embodiment of the invention, whether the failure rate of at least fault domains in the distributed storage cluster reaches the th preset value or not is determined, and when the failure rate of each fault domain exceeds the th preset value, it is determined that the fault domain fails, the distributed storage cluster is re-divided, so that the probability that the performance is reduced and the service cannot normally run due to the fact that the fault domain fails to be processed in time is reduced.
In a specific implementation, when the hard disks in the non-failed failure domain in the distributed storage cluster are re-partitioned to obtain at least new failure domains, the operation parameters of at least hard disks in the non-failed failure domain need to be obtained, and then the hard disks in the non-failed failure domain are re-partitioned to obtain at least new failure domains according to the initial failure domain partitioning rule and the operation parameters.
In the embodiment of the invention, after the fault domain fails, the hard disks in the fault domain which does not fail are re-divided based on the initial fault domain division rule and the operation parameters to obtain a new fault domain, so that the new fault domain division of the distributed storage cluster is more reasonable.
In a specific implementation, before recovering and rebalancing data fragments in a failed domain in a reselected hard disk, it is further required to determine that traffic volume running in a distributed storage cluster is smaller than a second preset value.
In the embodiment of the invention, the recovery and rebalancing are carried out when the traffic volume running in the distributed storage cluster is smaller than the second preset value, so that the traffic volume running in the distributed storage cluster is not too large, the condition of blocking during running of the distributed storage cluster is reduced, and the user experience is improved.
In a second aspect, an apparatus for data protection, comprising at least processing units and at least storage units, wherein the storage units store program code that, when executed by the processing units, causes the processing units to:
the method comprises the steps of carrying out repartitioning on hard disks in fault domains without faults in a distributed storage cluster to obtain at least new fault domains, reselecting the hard disks with data fragmentation distribution in the new fault domains according to an initial data fragmentation distribution rule, and carrying out recovery and rebalancing on the data fragmentation in the fault domains with faults in the reselected hard disks.
In a specific implementation, the processing unit is further configured to:
before the hard disks in the fault domains without faults in the distributed storage cluster are re-divided to obtain at least new fault domains, determining that the hard disk parameter values in at least fault domains in the distributed storage cluster do not meet the preset range of preset hard disk parameter values, or
And determining that the failure rate of at least failure domains in the distributed storage cluster reaches a preset value.
In a specific implementation, the processing unit is specifically configured to:
the method comprises the steps of obtaining operation indexes of at least fault domains of the distributed storage cluster in an operation state, and determining the fault rate of at least fault domains according to the operation indexes.
In a specific implementation, the processing unit is further configured to:
and according to the initial fault domain division rule and the operation parameters, the hard disks in the fault domain which does not have faults are divided again to obtain at least new fault domains.
In a specific implementation, the processing unit is further configured to:
and before the data fragments in the fault domain with faults are recovered and rebalanced in the reselected hard disk, determining that the service volume running in the distributed storage cluster is less than a second preset value.
The technical effects of any implementation manners in the second aspect can be found in the technical effects of the implementation manner in the aspect, and are not described herein again.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a diagram illustrating methods for data protection according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of initially partitioning fault domains of a distributed storage cluster according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of repartitioning a fault domain of a distributed storage cluster according to an embodiment of the present invention;
FIG. 4 is a flowchart of a complete method for data protection according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an apparatus for data protection according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of another kinds of data protection devices according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides data protection methods and devices, which can be used in an application scenario of distributed storage cluster data storage, wherein a distributed storage cluster refers to a large number of PC servers interconnected through the Internet, providing overall services to the outside, and having no following characteristics:
(1) the scalable, distributed storage cluster system can be scaled to cluster sizes of hundreds or even thousands, and the overall performance of the system is increased linearly.
(2) Low cost, allowing distributed storage clusters to be built on low cost servers due to the automatic fault tolerance and automatic load balancing characteristics of the distributed storage clusters, and linear scalability also allows the cost of the servers to be reduced.
(2) High performance, whether for a single server or for the entire distributed storage cluster, requires that the distributed storage cluster have high performance.
(1) The distributed storage cluster needs to provide an interface which is convenient and easy to use for the outside, needs to be provided with a perfect monitoring tool and an operation and maintenance tool, and requires to be conveniently integrated with other systems.
However, once the physical topology in the distributed storage cluster changes, the original fault domain division and data fragmentation cross-fault domain distribution rules cannot meet the requirements, which causes the performance of the distributed storage cluster to be reduced and affects the service operation.
In the embodiment of the invention, once the fault domain fails, the distributed storage cluster can timely re-partition the hard disks in the fault domain which does not fail to obtain a new fault domain, and then recover the lost data based on the re-partitioned fault domain, thereby avoiding the performance reduction of the distributed storage cluster and ensuring the normal operation of the service.
For purposes of clarity, technical solutions and advantages of the present invention, the present invention will be described in further detail with reference to the accompanying drawings , and it is to be understood that the described embodiments are only a partial embodiment, rather than a complete embodiment, .
As shown in fig. 1, an embodiment of the present invention provides methods for data protection, where the method includes:
100, repartitioning hard disks in fault domains without faults in the distributed storage cluster to obtain at least new fault domains;
step 101, reselecting a hard disk with data fragment distribution in the new fault domain according to an initial data fragment distribution rule;
and 102, recovering and rebalancing the data fragments in the failed domain in the reselected hard disk.
In the embodiment of the invention, the distributed storage cluster can re-partition the hard disks in the fault domain which does not have faults to obtain at least new fault domains, reselect the hard disks in the data fragment distribution in the new fault domains according to the initial data fragment distribution rule, and recover and re-balance the hard disks in the reselected hard disks.
And dividing fault domains of the distributed storage cluster by adopting a reasonable initial fault domain dividing rule according to a physical topological structure of the distributed storage cluster, and assigning hard disks in the divided fault domains according to the initial data fragment distribution rule, wherein the distributed storage cluster can operate based on the divided fault domains.
Here, the fault domain refers to areas logically isolated in the distributed storage cluster, and an internal fault occurring in an area does not affect other isolated areas (other fault domains), so the fault domain may also be called an isolated domain.
The initial fault domain rule refers to a rule that needs to divide the distributed storage cluster into a set of several fault domains and which regions serve as fault domains.
For example, the initial fault domain rule may be that two servers in the distributed storage cluster are the zone of the fault domain, and each servers in the two are fault domains.
For another example, the initial fault domain rule may be to treat all servers in rooms in the distributed storage cluster as fault domains.
In the operation process, operation indexes of at least fault domains of the distributed storage cluster in the operation state are obtained based on the use condition of the distributed storage cluster, and then the fault rate of at least fault domains is determined according to the obtained operation indexes.
The failure rate of the fault domain refers to the probability of failure of the fault domain;
the operation indexes of the fault domains are logic concepts, specifically refer to operation indexes of hard disks in fault domains in the fault domains, and may specifically be part or all of the following parameters:
the utilization rate of the hard disk, the read-write speed, the annual failure probability of the hard disk, the temperature, the average erasing times, the reading error times, the rotating speed of the hard disk, the power-on time of the hard disk and the like.
Determining the failure rate of at least failure domains according to the acquired operation indexes, wherein optional embodiments are that the annual failure probability of the hard disks in the failure domains is obtained by using the operation indexes, and the annual failure probability of the hard disks in the failure domains is used as the failure rate.
For example, the annual failure probability of a hard disk is calculated according to the following formula:
in the above formula, AFR is the annual failure probability of the hard disk, and MTBF is the mean time between failures, and can be provided by the hard disk manufacturer.
Here, it should be noted that: it should be emphasized that the AFR calculation method of other hardware (such as CPU, fan) can also be obtained by applying the above formula, and for simplifying the description, the embodiment of the present invention only considers the case of hard disk damage as the failure rate of the fault domain.
Correspondingly, the failure probability symbol of the hard disk in the fixed year time of is obtained according to the Poisson distribution, which is specifically as follows:
Figure BDA0001737734850000072
wherein n represents the number of failed hard disks within time, λ represents the expected value of single hard disk damage per unit time, t represents the preset time, λ can be obtained from annual failure probability AFR, such as:
Figure BDA0001737734850000073
wherein N represents the number of hard disks of the entire storage cluster.
Therefore, the probability of more than hard disk damages in years can be obtained by the following formula:
Figure BDA0001737734850000074
here, it should be noted that: the above description uses the annual failure probability of the hard disk as the failure rate, which is only an example, and it is within the protection scope of the embodiment of the present invention to use other parameters as the failure rate in the specific implementation, for example, the hard disk usage is used as the failure rate.
After the failure rates of at least failure domains in the distributed storage cluster are obtained, whether the failure rate of at least failure domains in the distributed storage cluster reaches the th preset value is judged, and if the failure rate reaches the th preset value, the failure domain is considered to have a failure.
For example, the th preset value is set to 10, a th fault domain, a second fault domain and a third fault domain exist in the distributed storage cluster, during operation, operation indexes of at least three fault domains of the distributed storage cluster in an operating state are obtained based on the use condition of the distributed storage cluster, and then the failure rate of the th fault domain is determined to be 5, the failure rate of the second fault domain is determined to be 5, the failure rate of the third fault domain is determined to be 11 according to the obtained operation indexes, and at this time, if the failure rate of the third fault domain exceeds the th preset value, the third fault domain is set to 10, and it is determined that the third fault domain fails.
Optionally, in the operation process, hard disk parameter values in at least fault domains in the distributed storage cluster may also be obtained based on the use condition of the distributed storage cluster, and it is determined whether the hard disk parameters are equal to the preset range of the preset hard disk parameters, and when the hard disk parameter values are not within the preset range of the preset hard disk parameter values, it is determined that at least fault domains in the distributed storage cluster have failed.
For example, for some specific hard disk parameter values, if the obtained hard disk parameter values are not 0, the greater the number of the obtained hard disk parameter values, the greater the possibility of failure, and when the obtained hard disk parameter values exceed a preset range, the failure domain where the hard disk parameter values are located is considered to have failed, and other parameters such as the state of the hard disk exist, if the obtained hard disk parameter values represent offline, the hard disk is considered to have been damaged, and when the damage rate of the hard disk in obtained failure domains exceeds a preset range, such as 50%, the failure domain where the hard disk is located is considered to have failed.
Here, it should be noted that: when the distributed storage cluster has the following phenomena, for example, when a hard disk is damaged, the process constructed based on the hard disk cannot normally run, the cluster state is abnormal, the number of normal processes is reduced, and at this time, it is determined that a fault domain where the damaged hard disk is located has failed.
After determining that at least fault domains in the distributed storage cluster have faults, acquiring operation parameters of at least hard disks in the fault domains without faults, and then re-dividing the hard disks in the fault domains without faults according to the initial fault domain division rule and the operation parameters to obtain at least new fault domains.
The operation parameter of the hard disk herein refers to a parameter value of the hard disk in an operation state, and may specifically be part or all of the following parameters:
the utilization rate of the hard disk, the read-write speed, the annual failure probability of the hard disk, the temperature, the average erasing times, the reading error times, the rotating speed of the hard disk, the power-on time of the hard disk and the like.
For example, the distributed storage cluster is divided into th fault domain, a second fault domain and a third fault domain according to an initial fault domain division rule, during operation, the third fault domain fails, and at this time, the utilization rate of at least hard disks in the th fault domain and the second fault domain is obtained, it is known that the utilization rate of hard disks in the th fault domain is 50%, and the utilization rate of hard disks in the second fault domain is 80%;
since the initial fault domain division rule is to divide the fault domain into 3 fault domains, the combination of the hard disk in the th fault domain and the hard disk in the second fault domain needs to be divided into 3 fault domains.
For example, new fault domains are obtained by combining the part of the hard disks in the fault domain with the part of the hard disks in the second fault domain, and the rest of the hard disks in the fault domain and the rest of the hard disks in the second fault domain are respectively combined to serve as two new fault domains.
For another example, the distributed storage cluster is divided into th fault domain and a second fault domain according to an initial fault domain division rule, and in the operation process, the second fault domain is found to have a fault, and at this time, the annual failure probability of the hard disk in th fault domain is obtained, knowing that the annual failure probability of two hard disks in th fault domain is 50%, and the annual failure probability of the remaining 4 hard disks is 80%;
since the initial fault domain division rule is to divide the fault domain into 2 fault domains, at this time, it is necessary to subdivide the hard disks in the th fault domain into 2 fault domains, and it is possible to combine the hard disks with a 50% annual failure probability and 2 80% annual failure probabilities to obtain new fault domains, and combine the remaining hard disks with a 50% annual failure probability and 2 80% annual failure probabilities to obtain another new fault domains.
The explanation here is: the above manner of repartitioning the hard disk in the non-failed domain according to the initial fault domain partitioning rule and the operation parameter is only an example, and other manners of partitioning and combining are also within the protection scope of the embodiment of the present invention.
And after at least new fault domains are obtained, the hard disks distributed by the data fragments are reselected in the new fault domains according to the initial data fragment distribution rule.
The initial data fragmentation distribution rule refers to the number of hard disks selected in a divided fault domain after the distributed storage cluster is created.
For example, the initial data fragmentation distribution rule is to select hard disks in the failed domain and, after repartitioning, still select hard disks in the new failed domain.
Correspondingly, after the service volume running in the distributed storage cluster is determined to be smaller than the second preset value, the data fragments in the fault domain with the fault are recovered and rebalanced in the reselected hard disk until the distributed storage cluster is recovered to be normal.
The running traffic may be a traffic used in running in the distributed storage cluster, and the specific value of the th preset value and the second preset value may be set according to actual needs, which is not limited in the embodiment of the present invention.
The following detailed description of embodiments of the invention refers to the accompanying drawings.
If there are 3 servers in distributed storage clusters, each server has 4 hard disks, each server includes a CPU (Central Processing Unit), a memory, a network card, a fan, a power lamp, and other hardware modules, corresponding hardware indexes may be read error times of the hard disks, power-on time of the hard disks, throughput (bandwidth/iops) of the hard disks, network bandwidth, and other hardware indexes may be response time, request error rate, cluster operating state, process operating state, and other hardware indexes that can be obtained from the hardware can be obtained by a prior art method, which is not described herein again in the embodiments of the present invention.
When the distributed storage cluster is completed, a fault domain and a data sharding distribution rule are specified, in this embodiment, each server is fault domains, and there are 3 fault domains in total, and then for each fault domains, hard disks are selected in each fault domains based on the data sharding distribution rule, with specific effects as shown in fig. 2.
In the embodiment of the invention, the distributed storage cluster adopts a data redundancy strategy of three copies, so 3 parts of data need to be written into the distributed storage cluster, at the moment, the 3 parts of data are placed on 3 servers, and each server stores 1 part of data.
In the operation process, based on the use condition of the distributed storage cluster, the operation indexes of the 3 fault domains of the distributed storage cluster in the operation state are obtained, and then the fault rates of the 3 fault domains are determined according to the obtained operation indexes, wherein optional embodiments are that the annual failure probability of the hard disks in the fault domains is obtained by using the operation indexes, and the annual failure probability of the hard disks in the fault domains is used as the fault rate.
Correspondingly, the annual failure probability of the hard disk of the server 2 reaches the th preset value obtained through the formula, at this time, it is considered that the server 2 has a failure, only 2 failure domains remain in the distributed storage cluster, and the operation parameters of the hard disks in the failure domains that do not have the failure need to be obtained to subdivide the failure domains, for example, two hard disks in the server 1 are used as a new failure domain 1, two remaining hard disks in the server 1 and two hard disks in the server 3 are combined to be used as a new failure domain 2, and a new failure domain 3 of two remaining hard disks in the server 3 is used, which has the specific effect shown in fig. 3.
The explanation here is: the above-mentioned manner of combining hard disks is only an example, and other manners of dividing and combining are also within the scope of the embodiments of the present invention.
Correspondingly, hard disks are selected in the 3 divided new failure domains based on the initial data fragment distribution rule, data copies are placed in each new failure domain, and when the traffic volume running in the distributed storage cluster is smaller than a second preset value, the data fragments in the failed failure domain are restored and rebalanced in the reselected hard disks.
For example, if the early morning traffic is smaller than the set second preset value, the data fragments in the failure domain that will fail in the early morning are selected to be restored and rebalanced in the reselected hard disk.
Wherein, kinds of optional modes for recovering and rebalancing the data fragments in the failed domain in the reselected hard disk are as follows:
for example, three data copies exist in hard disk groups {1,3,5}, wherein the three data copies are respectively placed in the hard disk 1, the hard disk 3 and the hard disk 5, and the hard disk 1, the hard disk 3 and the hard disk 5 respectively belong to three fault domains, at the moment, the hard disk 5 becomes unusable, after the fault domain is redefined, the hard disk 7 is selected to replace the hard disk 5 in a new fault domain, so that the hard disk group storing the data copies becomes {1,3, 7}, wherein the hard disk 1 and the hard disk 3 respectively have complete data, the hard disk 7 does not have data, and at the moment, complete data needs to be written in the hard disk 7 according to the data in the hard disk 1 and the hard disk 3 to reconstruct the three data copies.
As shown in fig. 4, a flowchart of a complete method for providing data protection according to an embodiment of the present invention:
step 400, dividing the distributed storage cluster into initial fault domains, and assigning a data fragmentation rule;
step 401, judging whether at least fault domains in the distributed storage cluster have faults, if so, executing step 402, otherwise, the distributed storage cluster is normal;
step 402, the hard disks in the fault domains which do not have faults in the distributed storage cluster are re-divided to obtain at least new fault domains;
step 403, reselecting a hard disk with distributed data fragments in the new fault domain according to an initial data fragment distribution rule;
step 404, judging that the traffic running in the distributed storage cluster is smaller than a second preset value, if so, executing step 405, otherwise, executing step 406;
step 405, the data fragments in the fault domain with the fault are recovered and rebalanced in the reselected hard disk;
step 406, continuing to wait until the traffic volume is smaller than a second preset value, and then executing step 405;
and step 407, the distributed storage cluster is recovered to be normal.
As shown in fig. 5, data protection devices according to an embodiment of the present invention include at least processing units 500 and at least storage units 501, wherein the storage units store program codes, and when the program codes are executed by the processing units 500, the processing units 500 execute the following processes:
the method comprises the steps of carrying out repartitioning on hard disks in fault domains without faults in a distributed storage cluster to obtain at least new fault domains, reselecting the hard disks with data fragmentation distribution in the new fault domains according to an initial data fragmentation distribution rule, and carrying out recovery and rebalancing on the data fragmentation in the fault domains with faults in the reselected hard disks.
Optionally, the processing unit 500 is further configured to:
before the hard disks in the fault domains without faults in the distributed storage cluster are re-divided to obtain at least new fault domains, determining that the hard disk parameter values in at least fault domains in the distributed storage cluster do not meet the preset range of preset hard disk parameter values, or
Determining that the failure rate of at least failure domains in the distributed storage cluster reaches a th preset value.
Optionally, the processing unit 500 is specifically configured to:
the method comprises the steps of obtaining operation indexes of at least fault domains of the distributed storage cluster in an operation state, and determining the fault rate of at least fault domains according to the operation indexes.
Optionally, the processing unit 500 is further specifically configured to:
and according to the initial fault domain division rule and the operation parameters, the hard disks in the fault domain which does not have faults are divided again to obtain at least new fault domains.
Optionally, the processing unit 500 is further configured to:
and before the data fragments in the fault domain with faults are recovered and rebalanced in the reselected hard disk, determining that the service volume running in the distributed storage cluster is less than a second preset value.
As shown in fig. 6, kinds of data protection devices according to an embodiment of the present invention include:
the partitioning module 600 is configured to re-partition the hard disks in the failed domain that does not have a failure in the distributed storage cluster to obtain at least new failed domains;
a selecting module 601, configured to reselect a hard disk for data fragment distribution in the new failure domain according to an initial data fragment distribution rule;
and a recovery module 602, configured to recover and rebalance the data fragments in the failed domain in the reselected hard disk.
Optionally, the dividing module 600 is further configured to:
determining that the hard disk parameter values in at least fault domains in the distributed storage cluster do not meet the preset range of the preset hard disk parameter values, or
Determining that the failure rate of at least failure domains in the distributed storage cluster reaches a th preset value.
Optionally, the dividing module 600 is further configured to:
acquiring operation indexes of at least fault domains of the distributed storage cluster in an operation state;
determining a failure rate of the at least failure domains based on the operational indicator.
Optionally, the dividing module 600 is specifically configured to:
acquiring the operation parameters of at least hard disks in the fault domain without faults;
and according to the initial fault domain division rule and the operation parameters, re-dividing the hard disks in the fault domains without faults to obtain at least new fault domains.
Optionally, the recovery module 602 is further configured to:
and determining that the traffic running in the distributed storage cluster is smaller than a second preset value.
In possible embodiments, aspects of the data protection provided by the embodiments of the invention may also be implemented in the form of program products comprising program code for causing a computer device to perform the steps in the methods of data protection according to the various exemplary embodiments of the invention described in this specification when the program code is run on the computer device.
A more specific example (a non-exhaustive list) of the readable storage medium includes an electrical connection having or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A program product for data forwarding control according to an embodiment of the present invention may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a server device. However, the program product of the present invention is not limited thereto, and in this document, the readable storage medium may be any tangible medium containing or storing the program, which can be used by or in connection with an information transmission, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave .
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language, or similar programming languages.
computer-readable storage media, i.e., a storage medium that is not lost after power is turned off, are provided for a method of data protection according to an embodiment of the present invention, the storage medium having stored therein a software program comprising program code that, when executed on a computing device, when read and executed by or more processors, implements a scheme of data protection according to an embodiment of the present invention.
It will be appreciated that blocks of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart illustrations.
This application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system .
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

  1. A method of data protection of , the method comprising:
    the hard disks in the fault domains without faults in the distributed storage cluster are re-divided to obtain at least new fault domains;
    reselecting a hard disk for data fragment distribution in the new fault domain according to an initial data fragment distribution rule;
    and recovering and rebalancing the data fragments in the failed domain in the reselected hard disk.
  2. 2. The method of claim 1, wherein before repartitioning the hard disks in the non-failed domain of the distributed storage cluster to obtain at least new failed domains, further comprising:
    determining that the hard disk parameter values in at least fault domains in the distributed storage cluster do not meet the preset range of the preset hard disk parameter values, or
    Determining that the failure rate of at least failure domains in the distributed storage cluster reaches a th preset value.
  3. 3. The method of claim 2, wherein the failure rate is determined by:
    acquiring operation indexes of at least fault domains of the distributed storage cluster in an operation state;
    determining a failure rate of the at least failure domains based on the operational indicator.
  4. 4. The method of claim 1, wherein said repartitioning of hard disks in non-failing domains in the distributed storage cluster into at least new failure domains comprises:
    acquiring the operation parameters of at least hard disks in the fault domain without faults;
    and according to the initial fault domain division rule and the operation parameters, re-dividing the hard disks in the fault domains without faults to obtain at least new fault domains.
  5. 5. The method of claim 1, wherein before recovering and rebalancing the data fragments in the failed domain in the reselected hard disk, further comprising:
    and determining that the traffic running in the distributed storage cluster is smaller than a second preset value.
  6. apparatus for data protection, comprising at least processing units and at least storage units, wherein said storage units store program code that, when executed by said processing units, causes said processing units to perform the following:
    the method comprises the steps of carrying out repartitioning on hard disks in fault domains without faults in a distributed storage cluster to obtain at least new fault domains, reselecting the hard disks with data fragmentation distribution in the new fault domains according to an initial data fragmentation distribution rule, and carrying out recovery and rebalancing on the data fragmentation in the fault domains with faults in the reselected hard disks.
  7. 7. The device of claim 6, wherein the processing unit is further to:
    before the hard disks in the fault domains without faults in the distributed storage cluster are re-divided to obtain at least new fault domains, determining that the hard disk parameter values in at least fault domains in the distributed storage cluster do not meet the preset range of preset hard disk parameter values, or
    Determining that the failure rate of at least failure domains in the distributed storage cluster reaches a th preset value.
  8. 8. The device of claim 7, wherein the processing unit is specifically configured to:
    the method comprises the steps of obtaining operation indexes of at least fault domains of the distributed storage cluster in an operation state, and determining the fault rate of at least fault domains according to the operation indexes.
  9. 9. The device of claim 6, wherein the processing unit is further to:
    and according to the initial fault domain division rule and the operation parameters, the hard disks in the fault domain which does not have faults are divided again to obtain at least new fault domains.
  10. 10. The device of claim 6, wherein the processing unit is further to:
    and before the data fragments in the fault domain with faults are recovered and rebalanced in the reselected hard disk, determining that the service volume running in the distributed storage cluster is less than a second preset value.
CN201810803843.3A 2018-07-20 2018-07-20 Data protection method and equipment Active CN110737924B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810803843.3A CN110737924B (en) 2018-07-20 2018-07-20 Data protection method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810803843.3A CN110737924B (en) 2018-07-20 2018-07-20 Data protection method and equipment

Publications (2)

Publication Number Publication Date
CN110737924A true CN110737924A (en) 2020-01-31
CN110737924B CN110737924B (en) 2021-07-27

Family

ID=69234265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810803843.3A Active CN110737924B (en) 2018-07-20 2018-07-20 Data protection method and equipment

Country Status (1)

Country Link
CN (1) CN110737924B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111880748A (en) * 2020-07-30 2020-11-03 北京计算机技术及应用研究所 Wear balancing method for solid state disk of distributed storage system
CN112000286A (en) * 2020-08-13 2020-11-27 北京浪潮数据技术有限公司 Four-control full-flash-memory storage system and fault processing method and device thereof
CN114692229A (en) * 2022-03-30 2022-07-01 中国电信股份有限公司 Hard disk unauthorized access detection method and device, computer equipment and storage medium
CN115250225A (en) * 2022-07-25 2022-10-28 济南浪潮数据技术有限公司 Network health monitoring method, device and medium based on fault domain detection
WO2023169185A1 (en) * 2022-03-10 2023-09-14 华为技术有限公司 Memory management method and device
WO2024213013A1 (en) * 2023-04-10 2024-10-17 阿里云计算有限公司 Device scheduling method and system, electronic device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103095804A (en) * 2011-12-13 2013-05-08 微软公司 Load Balancing In Cluster Storage Systems
CN103984607A (en) * 2013-02-08 2014-08-13 华为技术有限公司 Distributed storage method, device and system
CN104583930A (en) * 2014-08-15 2015-04-29 华为技术有限公司 Method of data migration, controller and data migration apparatus
CN107943421A (en) * 2017-11-30 2018-04-20 成都华为技术有限公司 A kind of subregion partitioning method and device based on distributed memory system
CN108153622A (en) * 2016-12-06 2018-06-12 华为技术有限公司 The method, apparatus and equipment of a kind of troubleshooting
CN108153491A (en) * 2017-12-22 2018-06-12 深圳市瑞驰信息技术有限公司 A kind of storage method and framework for closing part server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103095804A (en) * 2011-12-13 2013-05-08 微软公司 Load Balancing In Cluster Storage Systems
CN103984607A (en) * 2013-02-08 2014-08-13 华为技术有限公司 Distributed storage method, device and system
CN104583930A (en) * 2014-08-15 2015-04-29 华为技术有限公司 Method of data migration, controller and data migration apparatus
CN108153622A (en) * 2016-12-06 2018-06-12 华为技术有限公司 The method, apparatus and equipment of a kind of troubleshooting
CN107943421A (en) * 2017-11-30 2018-04-20 成都华为技术有限公司 A kind of subregion partitioning method and device based on distributed memory system
CN108153491A (en) * 2017-12-22 2018-06-12 深圳市瑞驰信息技术有限公司 A kind of storage method and framework for closing part server

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111880748A (en) * 2020-07-30 2020-11-03 北京计算机技术及应用研究所 Wear balancing method for solid state disk of distributed storage system
CN111880748B (en) * 2020-07-30 2023-10-31 北京计算机技术及应用研究所 Solid state disk wear balancing method for distributed storage system
CN112000286A (en) * 2020-08-13 2020-11-27 北京浪潮数据技术有限公司 Four-control full-flash-memory storage system and fault processing method and device thereof
WO2023169185A1 (en) * 2022-03-10 2023-09-14 华为技术有限公司 Memory management method and device
CN114692229A (en) * 2022-03-30 2022-07-01 中国电信股份有限公司 Hard disk unauthorized access detection method and device, computer equipment and storage medium
CN114692229B (en) * 2022-03-30 2023-11-10 中国电信股份有限公司 Hard disk unauthorized access detection method, device, computer equipment and storage medium
CN115250225A (en) * 2022-07-25 2022-10-28 济南浪潮数据技术有限公司 Network health monitoring method, device and medium based on fault domain detection
WO2024213013A1 (en) * 2023-04-10 2024-10-17 阿里云计算有限公司 Device scheduling method and system, electronic device, and storage medium

Also Published As

Publication number Publication date
CN110737924B (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN110737924B (en) Data protection method and equipment
US8954545B2 (en) Fast determination of compatibility of virtual machines and hosts
US10175973B2 (en) Microcode upgrade in a storage system
CN108153849B (en) Database table segmentation method, device, system and medium
US10229023B2 (en) Recovery of storage device in a redundant array of independent disk (RAID) or RAID-like array
US10255124B1 (en) Determining abnormal conditions of host state from log files through Markov modeling
CN111104051B (en) Method, apparatus and computer program product for managing a storage system
CN113051104B (en) Method and related device for recovering data between disks based on erasure codes
CN108540315A (en) Distributed memory system, method and apparatus
US20160342358A1 (en) Storage control apparatus and storage system
CN112748862B (en) Method, electronic device and computer program product for managing a disk
US10977130B2 (en) Method, apparatus and computer program product for managing raid storage in data storage systems
CN111045853A (en) Method and device for improving erasure code recovery speed and background server
AU2021269916B2 (en) Write sort management in data storage system
CN113391937B (en) Method, electronic device and computer program product for storage management
US10915405B2 (en) Methods for handling storage element failures to reduce storage device failure rates and devices thereof
CN116360680A (en) Method and system for performing copy recovery operations in a storage system
US20180295195A1 (en) Method and apparatus for performing storage space management for multiple virtual machines
US11580022B2 (en) Write sort management in a multiple storage controller data storage system
CN112748860B (en) Method, electronic device and computer program product for storage management
US10725879B2 (en) Resource management apparatus, resource management method, and nonvolatile recording medium
JP7532882B2 (en) Fault determination device, fault determination method, and fault determination program
CN110781484B (en) Computing device, method of computing device, and article comprising storage medium
JP2023134170A (en) Storage medium management device, method for managing storage medium, and storage medium management program
CN118885337A (en) Fault reporting method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant