CN108647118B - Storage cluster-based copy exception recovery method and device and computer equipment - Google Patents

Storage cluster-based copy exception recovery method and device and computer equipment Download PDF

Info

Publication number
CN108647118B
CN108647118B CN201810460225.3A CN201810460225A CN108647118B CN 108647118 B CN108647118 B CN 108647118B CN 201810460225 A CN201810460225 A CN 201810460225A CN 108647118 B CN108647118 B CN 108647118B
Authority
CN
China
Prior art keywords
copy
storage cluster
copies
data
exception
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810460225.3A
Other languages
Chinese (zh)
Other versions
CN108647118A (en
Inventor
刘浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd Chengdu Branch
Original Assignee
New H3C Technologies Co Ltd Chengdu Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd Chengdu Branch filed Critical New H3C Technologies Co Ltd Chengdu Branch
Priority to CN201810460225.3A priority Critical patent/CN108647118B/en
Publication of CN108647118A publication Critical patent/CN108647118A/en
Application granted granted Critical
Publication of CN108647118B publication Critical patent/CN108647118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a storage cluster-based copy exception recovery method and device and computer equipment. The method comprises the following steps: acquiring each copy in the abnormal storage cluster; selecting a copy from the acquired copies according to a preset copy exception recovery strategy, and repairing the copy in the exception storage cluster according to the selected copy, wherein the copy exception recovery strategy comprises the following steps: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy. Therefore, the abnormal storage cluster can quickly recover and provide services, the time for user service interruption is shortened, and the availability of the storage cluster is improved.

Description

Storage cluster-based copy exception recovery method and device and computer equipment
Technical Field
The present application relates to the technical field of communication exception recovery, and in particular, to a method and an apparatus for recovering a copy exception based on a storage cluster, and a computer device.
Background
With the development of image technologies such as big data, high definition, 4K and the like and the landing of video monitoring projects such as "safety china", the demands of enterprises on the capacity and reliability of a storage system are rapidly increasing. Since mass user data which is very important for enterprises is stored in the storage system, how to ensure high reliability of the mass user data and high availability of the storage system is a hotspot of data storage technology research.
At present, generally, a cluster multi-copy technology is adopted to ensure high reliability of user data and high availability of a storage system, user data is stored in a host of a storage cluster to form a copy, then the host synchronizes the copy containing the user data to each standby machine, the data stored in each host or standby machine is a copy, during operation of the storage cluster, data to be synchronized is written into a disk cache of the host by using a data synchronization technology, then the copy is written into the copy of the host by the disk cache to be updated, the updated copy is synchronized to a physical disk of the standby machine by the host to realize copy synchronization, so that data synchronization in the copy of the storage cluster is maintained, copy synchronization is realized, and therefore when one host is abnormal, other standby machines can take over the abnormal host in time to ensure normal operation of storage cluster services.
In the existing primary and standby mode of a storage cluster, a situation that a host is abnormal and other standby machines take over to maintain normal operation is targeted, but if the whole storage cluster is abnormal, data stored by each physical disk may be inconsistent, so that the storage cluster is in an abnormal unavailable state, and the availability of the storage cluster is low. For example, in the process of writing data to the physical disk by the disk cache, since the data writing speeds of writing data to the physical disks are not consistent, if the computer room is suddenly powered down, the data in the disk cache is lost, and simultaneously the data written into each physical disk in the storage cluster is inconsistent, since the storage cluster notifies each physical disk of the successful data writing (representing the completion of data synchronization) mechanism after writing the data into the disk cache, after the power failure and restart, when the storage cluster performs data verification (because the data synchronization is finished, the verification results are consistent) to provide services under the condition that the writing of each physical disk is successful, the verification results of the data written by each physical disk are different, and the inconsistent data cannot be repaired according to the recovery process of the storage cluster, so that the storage cluster is in an abnormal unavailable state. At present, for the exception, the inconsistent data of each physical disk can only be found out manually to be deleted, and then the storage cluster is restarted to recover the environment providing the service to the outside. However, according to the manual repair method, a large amount of time is consumed for manually comparing and searching inconsistent data, so that a user cannot use the storage cluster for a long time, user services are affected, and the availability of the storage cluster is reduced.
Disclosure of Invention
In a first aspect, an embodiment of the present application provides a storage cluster-based copy exception recovery method, which is applied to a cluster including at least two cluster devices, where the at least two cluster devices include a host and a standby, and the method includes:
acquiring various copies of a host and a standby in the storage cluster with the exception;
selecting one copy from the acquired copies according to a preset copy exception recovery strategy, and repairing the copy in the storage cluster with the exception according to the selected copy, wherein the copy exception recovery strategy comprises the following steps: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy.
In a second aspect, an embodiment of the present application provides a storage cluster-based copy exception recovery apparatus, which is applied to a cluster including at least two cluster devices, where the at least two cluster devices include a host and a standby, and the apparatus includes: a copy acquisition module and a copy recovery module, wherein,
the copy acquisition module is used for acquiring the copies of the host and the standby machines in the storage cluster with the exception;
a copy recovery module, configured to select a copy from the acquired copies according to a preset copy exception recovery policy, and repair the copy in the storage cluster where the exception occurs according to the selected copy, where the copy exception recovery policy includes: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy.
In a third aspect, an embodiment of the present application provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, performs the steps of the method described above.
Compared with the prior art, the method has the following beneficial effects:
according to the method, the device and the computer equipment for recovering the abnormal copies based on the storage cluster, each abnormal copy in the storage cluster is obtained; selecting one copy from the acquired copies according to a preset copy exception recovery strategy, and repairing the copy in the storage cluster with the exception according to the selected copy, wherein the copy exception recovery strategy comprises the following steps: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy. Therefore, the abnormal storage cluster can quickly recover and provide services, the time for user service interruption is shortened, and the availability of the storage cluster is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic flowchart of a storage cluster-based copy exception recovery method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of the sub-steps of step 102 shown in FIG. 1
FIG. 3 is a second flowchart illustrating the sub-steps of step 102 shown in FIG. 1;
FIG. 4 is a third schematic flowchart illustrating the sub-steps of step 102 shown in FIG. 1;
fig. 5 is a schematic structural diagram of a storage cluster-based copy exception recovery apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
Fig. 1 is a schematic flowchart of a storage cluster-based copy exception recovery method according to an embodiment of the present application. The method is applied to a cluster comprising at least two cluster devices, wherein the at least two cluster devices comprise a host machine and a standby machine, and the cluster comprises but is not limited to: as shown in fig. 1, the process includes:
step 101, acquiring all copies of a host and a standby in the storage cluster with the exception;
in this embodiment, the copies of the host and the standby in the storage cluster in which the exception occurs are obtained. If there is data in the disk cache during power failure, copies written in each physical disk will be inconsistent, and thus, as an optional embodiment, obtaining each copy in the abnormal storage cluster includes:
restarting the storage cluster after monitoring power failure;
and after monitoring that the storage cluster finishes loading the copy, respectively acquiring the copy of a host and a backup in the storage cluster.
In this embodiment, in the operation process of the storage cluster, other abnormal or unexpected accidents may occur to cause inconsistency of the copies in the physical disks, and as another optional embodiment, acquiring each copy in the abnormal storage cluster may further include:
receiving preset abnormal information reported by an inspection tool, wherein the abnormal information is generated when the inspection tool finds that the copies are inconsistent in the operation process of the storage cluster;
and respectively acquiring the copies of the host and the standby in the storage cluster according to the received abnormal information.
In this embodiment, in the operation process of the storage cluster, an abnormal storage cluster is formed due to some abnormal or unexpected accidents, and when the copies are inconsistent, the abnormal storage cluster can be found by using the inspection tool, and abnormal information is generated and reported.
102, selecting a copy from the acquired copies according to a preset copy exception recovery strategy, and repairing the copy in the storage cluster with the exception according to the selected copy, wherein the copy exception recovery strategy comprises: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy.
Under the condition that the storage cluster is abnormal and automatic repair cannot be completed according to a preset repair strategy, how to reduce data loss as much as possible and quickly recover the storage cluster service environment to quickly provide services for users is a key technical problem for improving the availability of the storage cluster and improving user experience. In the embodiment, each copy in the abnormal storage cluster is obtained; selecting a copy from the acquired copies according to a preset copy exception recovery strategy, and repairing the copy in the exception storage cluster according to the selected copy, wherein the copy exception recovery strategy comprises the following steps: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy. Therefore, when the copies are inconsistent due to the problems of power failure and the like caused by the fact that the disk cache is not closed, the copies can be selected according to the copy abnormal recovery strategy, the copies serve as the standard, other copies are automatically repaired, time consumed for manually comparing and searching inconsistent data is avoided, the storage cluster can rapidly recover and provide services, and the availability of the storage cluster is improved. Meanwhile, through the copy exception recovery strategy, the data in the copy can be lost as little as possible, the manual participation is reduced, the service environment of the storage cluster is repaired quickly, and the service interruption time of a user is shortened.
In this embodiment, the copy exception recovery policy includes: as an optional embodiment, the majority consistent copy priority policy selects a copy from the acquired copies according to a preset copy exception recovery policy, and repairs the copy in the exception storage cluster according to the selected copy, as shown in fig. 2, where fig. 2 is one of the substep flow diagrams of step 102 shown in fig. 1, and includes:
step 201, counting the number of copies with consistent copies;
in this embodiment, when the storage cluster performs an update operation on the copy, for example, when data is written into the copy, data in the copy is updated, and the copy is added or deleted, a unique version number information is set for the copy in each update operation to identify the consistency of the data in each copy, for example, after one update, the version number information is set to V1.1.1, and in the following, after the copy data of the version is updated again, the version number information is set to V1.1.2. Thus, as an alternative embodiment, counting the number of copies that are consistent includes:
traversing each acquired copy, and sequentially extracting copy version number information;
and inquiring whether the extracted copy version number information is stored or not, if so, adding 1 to a copy number count value corresponding to the stored copy version number information, if not, constructing a mapping relation between the extracted copy version number information and the copy number count value, and setting the copy number count value mapped by the extracted copy version number information as 1.
In this embodiment, taking an example that the storage cluster includes three copies, extracting copy version number information of a first copy, constructing a first mapping relationship between the extracted copy version number information and a copy number count value, and setting an initial copy number count value to 1; then, extracting copy version number information of the second copy, inquiring the constructed first mapping relation, and if the first mapping relation contains the copy version number information of the second copy, adding 1 to the copy number count value in the first mapping relation; and finally, extracting copy version number information of the third copy, inquiring the constructed first mapping relation, if the first mapping relation does not contain the copy version number information of the third copy, constructing a second mapping relation between the extracted copy version number information of the third copy and a copy number count value, and setting the initial copy number count value to be 1.
In this embodiment, as an optional embodiment, the copy version number information includes transaction version number information, and each copy stores a transaction version number information, and the latest transaction version number information is transmitted each time the copy is updated. As another alternative, the transaction version number information is incremented in one direction, so that if the transaction version number information of two copies is the same, it indicates that the two copies are identical. As another alternative, the copy version number information may further include serial number information after each data update.
In this embodiment, as an optional embodiment, the constructed mapping relationship may further include each copy mapped by the copy version number information, so that the mapping relationship is queried according to the copy version number information in the following steps to obtain the mapped copy.
Step 202, repairing the copy in the storage cluster with the exception according to the first copy set with the largest number of copies.
In this embodiment, when the plurality of copies are in a consistent state in the storage cluster, the plurality of copies are used as a standard, that is, the plurality of consistent copies are used as a priority policy to recover a few inconsistent copies, so that resources and time required for synchronizing the copies in the abnormal storage cluster can be effectively reduced, the storage cluster can provide services to users more quickly, and the availability of the storage cluster and the service experience of the users are improved.
In this embodiment, still taking the storage cluster including three copies as an example, after the storage cluster is restarted due to power failure abnormality, version number information of the three copies is queried, if version number information of two copies is the same, it indicates that the two copies are identical, another different copy is recovered by using the two copies as a standard, and finally the three copies are identical.
Several processes for repairing the abnormal copy in the storage cluster according to the first copy set with the largest number of copies are described in detail below.
Firstly, before an exception occurs, the data written into the physical disk may be different because the rates of writing the disk cache into the physical disk are not consistent, and therefore, the copy priority policy may be combined to modify the copy for each copy under the condition that the version number information is the same. As an alternative embodiment, repairing the replica in the abnormal storage cluster according to the first replica set with the largest number of replicas includes:
comparing the data of each copy in the first copy set;
if the data of each copy in the first copy set is the same, selecting one copy in the first copy set, and repairing the copies, except the first copy set, in the storage cluster with the abnormality according to the selected copy;
if the data of the copies in the first copy set are different, selecting a second copy with the highest priority according to the priority of the copies in the first copy set, and repairing the copies except the second copy in the storage cluster with the abnormality according to the second copy.
In this embodiment, when the data of each copy in the first copy set is different, for example, the data content is different, as an optional embodiment, a copy with the highest priority may be selected according to a preset copy priority, for example, if the data of each copy in the first copy set is different, the data of the master copy is synchronized to other copies by comparing the priorities of the copies and using the master copy with the highest priority to restore different data and implement data synchronization, so that each copy in the storage cluster reaches a state of data consistency.
In this embodiment, as an optional embodiment, it may be determined whether the data of each copy is the same by performing a cyclic redundancy check on the copies.
In this embodiment, under the condition that the data of the respective copies are different, the data volumes of the copies may be the same or different due to different data sizes, for example, if the first copy set includes four copies, i.e., copy 1 to copy 4, where the data volume of copy 1 is 2.5M, the data volume of copy 2 is 2.5M, the data volume of copy 3 is 2.55M, and the data volume of copy 4 is 2.6M, the copy repair may also be performed in combination with the copy data volume priority policy. As another optional embodiment, before selecting a second copy with the highest priority according to the priority of each copy in the first copy set after the data of each copy in the first copy set are different, the method further includes:
respectively acquiring the data volume of each copy of the different data;
if the obtained data volume is the same, executing the step of selecting a second copy with the highest priority according to the priority of each copy in the first copy set;
and if the acquired data volume is different, repairing the copies except the copy with the maximum data volume in the abnormal storage cluster according to the copy with the maximum data volume.
In this embodiment, since the data volumes of the four copies are different, and the copy with the largest data volume is copy 4, copy 1 to copy 3 are repaired according to copy 4.
Secondly, before the exception occurs, the data amount written into the physical disk may be different due to the fact that the cache is not consistent with the rate written into the physical disk for each copy under the condition that the version number information is the same, and therefore the copy data amount priority strategy can be combined to modify the copy. As another alternative, repairing the copy in the storage cluster in which the exception occurs according to the first copy set with the largest number of copies may further include:
acquiring the data volume of each copy in the first copy set;
if the data volume of each copy in the first copy set is the same, selecting one copy in the first copy set, and repairing the copies except the first copy set in the abnormal storage cluster according to the selected copy;
and if the data volume of each copy in the first copy set is different, repairing the copy except the copy with the maximum data volume in the abnormal storage cluster according to the copy with the maximum data volume.
In this embodiment, as an optional embodiment, the data amount may be a capacity (size) of data, or may be the number of pieces of data, and the present embodiment does not limit this. In this embodiment, the copy with the largest data amount is used to repair other copies, so that data lost when the storage cluster is abnormal can be recovered as much as possible.
In this embodiment, as an optional embodiment, repairing the copy, except for the copy with the largest data amount, in the abnormal storage cluster according to the copy with the largest data amount includes:
and if the copy with the largest data volume is multiple and inconsistent, selecting a third copy with the highest priority according to the priority of the copy with the largest data volume, and repairing the copies except the third copy in the abnormal storage cluster according to the third copy.
In this embodiment, if the number of the copies with the largest data amount is four, i.e., the copy 1 to the copy 4, and the data in the four copies are inconsistent, the priorities of the four copies are sorted, and if the priority of the copy 1 is the highest, the copies 2 to 4 are repaired according to the copy 1.
In this embodiment, under the condition that the data volumes of the copies are the same, the data stored in the copies may be the same or different due to different data sizes, so as to be another optional embodiment, before selecting one copy in the first copy set after the data volumes of the copies in the first copy set are the same, the method further includes:
comparing the data of the copies with the same data quantity;
if the data of each copy is the same, executing the step of selecting one copy in the first copy set;
if the data of the copies are different, selecting a fourth copy with the highest priority according to the priorities of the copies with different data, and repairing the copies except the fourth copy in the abnormal storage cluster according to the fourth copy.
In practical applications, the random selection may also be performed from the first copy set with the largest number of copies, so as to, as yet another optional embodiment in this embodiment, repair the copy in the abnormal storage cluster according to the first copy set with the largest number of copies may further include:
randomly selecting a fifth copy from the first copy set;
and repairing the copies except the fifth copy in the abnormal storage cluster according to the selected fifth copy.
In this embodiment, because version number information in the copies is different, a plurality of first copy sets may be provided, where the number of counted copies is the largest, for example, the number of copies corresponding to version number information V1.1.1 and version number information V1.1.2 is the largest, and thus, as still another optional embodiment, the method further includes:
if the number of the first copy sets with the largest number of copies is multiple, selecting a sixth copy with the highest priority according to the priority of each copy in the multiple first copy sets, and repairing the copies except the sixth copy in the abnormal storage cluster according to the sixth copy.
In this embodiment, as a further optional embodiment, the method further includes:
if the number of the first copy sets with the largest number of copies is multiple, acquiring the data volume of each copy in the multiple first copy sets, and repairing the copies except the copy with the largest data volume in the storage cluster with the abnormality according to the copy with the largest data volume.
In this embodiment, as an optional embodiment, after repairing the copy in the storage cluster where the exception occurs, the method further includes:
and recording the repaired copy information.
In this embodiment, by recording the repaired copy information, it is convenient to perform corresponding analysis subsequently, where the recorded copy information includes: the repaired data information, version number information and the like can be sent to a preset alarm information center for unified storage as an optional embodiment.
In this embodiment, as another optional embodiment, the copy exception recovery policy includes: a multi-copy data volume priority policy, selecting one copy from the acquired copies according to a preset copy exception recovery policy, and repairing the copy in the storage cluster with the exception according to the selected copy, as shown in fig. 3, where fig. 3 is a second flowchart of a sub-step of step 102 shown in fig. 1, and includes:
step 301, counting the data volume of each copy;
step 302, extracting the copy with the largest data amount, and repairing the copies except the copy with the largest data amount in the storage cluster with the abnormality according to the copy with the largest data amount.
In this embodiment, as an optional embodiment, when the copy is updated each time, the latest transaction version number information is transmitted, and the transaction version number information is incremented in a single direction, so that the data volume of the copy can be obtained by obtaining the transaction version number information, that is, the larger the version number corresponding to the transaction version number information is, the newer the copy data corresponding to the version is, the larger the data volume is. As another alternative, the data size of the copy may also be obtained in other manners, for example, by obtaining the size information in the attribute of the copy.
As an optional embodiment, repairing the copy, except for the copy with the largest data amount, in the storage cluster with the largest data amount according to the copy with the largest data amount includes:
if the number of the copies with the largest data quantity is multiple, comparing the data of the multiple copies respectively;
if the compared data are the same, selecting one copy of the copies with the largest data quantity, and repairing the copies except the copy with the largest data quantity in the abnormal storage cluster according to the selected copy;
if the compared data are different, selecting a seventh copy with the highest priority according to the priority of each copy in the copies with the most data quantity, and repairing the copies except the seventh copy in the abnormal storage cluster according to the seventh copy.
In this embodiment, as a further optional embodiment, the copy exception recovery policy includes: a copy priority policy, selecting one copy from the acquired copies according to a preset copy exception recovery policy, and repairing the copy in the storage cluster with the exception according to the selected copy, as shown in fig. 4, where fig. 4 is a third flowchart of a sub-step of step 102 shown in fig. 1, and includes:
step 401, acquiring a copy priority of each copy;
step 402, sequencing the acquired copy priorities, selecting the copy with the highest copy priority, and repairing the copies except the copy with the highest copy priority in the abnormal storage cluster according to the copy with the highest copy priority.
In this embodiment, the copy priority of each copy is obtained, the obtained copy priorities are ranked, the copy corresponding to the highest ranked copy priority is selected, and other copies are repaired.
In this embodiment, as an optional embodiment, repairing the abnormal copy, except for the copy with the highest copy priority, in the storage cluster according to the copy with the highest copy priority includes:
if the number of the copies with the highest priority is multiple, counting the data volume of each copy in the multiple copies;
and extracting the copy with the largest data amount, and repairing the copy except the copy with the largest data amount in the abnormal storage cluster according to the copy with the largest data amount.
Fig. 5 is a schematic structural diagram of a storage cluster-based copy exception recovery apparatus according to an embodiment of the present application. The method is applied to a cluster comprising at least two cluster devices, wherein the at least two cluster devices comprise a host machine and a standby machine, and the cluster comprises but is not limited to: a Ceph distributed file storage cluster and a GlusterFS storage cluster, as shown in fig. 5, the apparatus includes: a duplicate acquisition module 51 and a duplicate recovery module 52, wherein,
a copy obtaining module 51, configured to obtain copies of the host and the standby in the storage cluster that are abnormal;
a copy recovery module 52, configured to select a copy from the acquired copies according to a preset copy exception recovery policy, and repair the copy in the storage cluster where the exception occurs according to the selected copy, where the copy exception recovery policy includes: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy.
In this embodiment, as an optional embodiment, the copy obtaining module 51 includes: a monitoring unit and a duplicate first acquisition unit (not shown), wherein,
the monitoring unit is used for monitoring that the storage cluster is restarted after power failure occurs, and informing the first acquisition unit of the copy after the storage cluster is monitored to be completely loaded with the copy;
and the first copy acquisition unit is used for respectively acquiring the copies of the host and the standby in the storage cluster according to the received notification.
In this embodiment, as another optional embodiment, the copy obtaining module 52 includes: an abnormal information receiving unit and a duplicate second acquiring unit, wherein,
the system comprises an abnormal information receiving unit, a judging unit and a judging unit, wherein the abnormal information receiving unit is used for receiving preset abnormal information reported by an inspection tool, and the abnormal information is generated when the inspection tool finds that copies are inconsistent in the operation process of a storage cluster;
and the second copy acquisition unit is used for respectively acquiring the copies of the host and the standby in the storage cluster according to the received abnormal information.
In this embodiment, as an optional embodiment, the copy recovery module 52 includes: a number of copies counting unit, and a first recovery unit of copies (not shown), wherein,
the copy number counting unit is used for counting the copy number with consistent copies;
and the first recovery unit of the copy is used for repairing the copy in the storage cluster with the exception according to the first copy set with the maximum number of copies.
In this embodiment, as an optional embodiment, the copy number statistics unit includes: a version number information extraction component and a statistics component, wherein,
the version number information extraction component is used for traversing each acquired copy and sequentially extracting copy version number information;
and the counting component is used for inquiring whether the extracted copy version number information is stored, if so, adding 1 to a copy number counting value corresponding to the stored copy version number information, if not, constructing a mapping relation between the extracted copy version number information and the copy number counting value, and setting the copy number counting value mapped by the extracted copy version number information as 1.
In this embodiment, as an optional embodiment, the first recovery unit for copy includes: a comparison component, a first recovery component and a second recovery component, wherein,
the comparison component is used for comparing the data of each copy in the first copy set;
a first recovery component, configured to select one copy in the first copy set if data of each copy in the first copy set is the same, and repair copies, except the first copy set, in the storage cluster where the abnormality occurs according to the selected copy;
and if the data of the copies in the first copy set are different, selecting a second copy with the highest priority according to the priority of the copies in the first copy set, and repairing the abnormal copies except the second copy in the storage cluster according to the second copy.
In this embodiment, as an optional embodiment, after the data of the copies in the first copy set are different, the second recovery component is further configured to, before selecting the second copy with the highest priority according to the priority of the copies in the first copy set:
respectively acquiring the data volume of each copy of the different data;
if the obtained data volume is the same, executing the operation of selecting a second copy with the highest priority according to the priority of each copy in the first copy set;
and if the acquired data volume is different, repairing the copies except the copy with the maximum data volume in the abnormal storage cluster according to the copy with the maximum data volume.
In this embodiment, as another optional embodiment, the first recovery unit for copy includes: a data volume acquisition component, a third recovery component, and a fourth recovery component, wherein,
the data volume acquisition component is used for acquiring the data volume of each copy in the first copy set;
a third recovery component, configured to select one copy in the first copy set if the data volumes of the copies in the first copy set are the same, and repair the copies, except the first copy set, in the storage cluster in which the abnormality occurs according to the selected copy;
and if the data volume of each copy in the first copy set is different, repairing the copy except the copy with the maximum data volume in the storage cluster with the abnormality according to the copy with the maximum data volume.
In this embodiment, as an optional embodiment, repairing the copy, except for the copy with the largest data size, in the storage cluster where the exception occurs according to the copy with the largest data size includes:
and if the copy with the largest data volume is multiple and inconsistent, selecting a third copy with the highest priority according to the priority of the copy with the largest data volume, and repairing the copies except the third copy in the abnormal storage cluster according to the third copy.
In this embodiment, as another optional embodiment, before selecting one copy of the first copy set after the data volumes of the copies in the first copy set are the same, the third recovery component is further configured to:
comparing the data of the copies with the same data quantity;
if the data of each copy is the same, executing the step of selecting one copy in the first copy set;
if the data of the copies are different, selecting a fourth copy with the highest priority according to the priorities of the copies with different data, and repairing the copies except the fourth copy in the abnormal storage cluster according to the fourth copy.
In this embodiment, as another optional embodiment, the copy recovery module 52 includes: a data amount statistic unit and a duplicate second recovery unit, wherein,
the data volume counting unit is used for counting the data volume of each copy;
a second recovery unit of the copy, configured to extract the copy with the largest data amount, and repair the copy, other than the copy with the largest data amount, in the storage cluster in which the abnormality occurs according to the copy with the largest data amount
In this embodiment, as a further optional embodiment, the copy recovery module 52 includes: a priority acquisition unit and a duplicate third recovery unit, wherein,
a priority acquiring unit for acquiring the copy priority of each copy;
and the third recovery unit of the copies is used for sequencing the acquired priority levels of the copies, selecting the copy with the highest priority level of the copies, and repairing the copies except the copy with the highest priority level of the copies in the abnormal storage cluster according to the copy with the highest priority level of the copies.
In this embodiment, as an optional embodiment, the storage cluster-based copy anomaly recovery apparatus in this embodiment of the present application may be used as an individual storage cluster control device, and operate in a storage cluster including a host, a standby, and a storage cluster control device, to perform anomaly monitoring on the entire storage cluster; or the system can be built in the host or the standby machine, operates in the storage cluster comprising the host and the standby machine, and monitors the abnormity of the whole storage cluster.
Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 6, an embodiment of the present application provides a computer device for executing the storage cluster-based copy exception recovery method in fig. 1 to fig. 4, where the computer device includes a memory 1000, a processor 2000, and a computer program stored in the memory 1000 and executable on the processor 2000, where the processor 2000 implements the steps of the storage cluster-based copy exception recovery method when executing the computer program.
Specifically, the memory 1000 and the processor 2000 may be general memories and general processors, which are not specifically limited herein, and when the processor 2000 runs a computer program stored in the memory 2000, the above method for recovering the copy exception based on the storage cluster may be executed, so as to solve the problem of low availability of the storage cluster in the prior art. Meanwhile, the service environment of the storage cluster is quickly repaired, and the service interruption time of the user is shortened.
Corresponding to the storage cluster-based copy exception recovery method in fig. 1 to 4, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the storage cluster-based copy exception recovery method.
Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk and the like, when a computer program on the storage medium is run, the method for recovering the abnormal copy based on the storage cluster can be executed, the problem of low availability of the storage cluster in the prior art is solved, the application selects the copy according to the abnormal copy recovery strategy, and automatically restores other copies by taking the copy as a standard, so that the time consumed by manually comparing and searching inconsistent data is avoided, the data in the copy can be lost as little as possible, the storage cluster can rapidly recover and provide services, and the availability of the storage cluster is improved. Meanwhile, the service environment of the storage cluster is quickly repaired, and the service interruption time of the user is shortened.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for recovering copy exception based on a storage cluster is applied to a cluster comprising at least two cluster devices, wherein the at least two cluster devices comprise a host and a standby, and the method comprises the following steps:
acquiring copies of a host and a standby in the storage cluster with the abnormality, wherein the storage cluster with the abnormality is characterized in that the storage cluster suddenly loses power in the normal operation process, and data stored by each physical disk is inconsistent;
selecting one copy from the acquired copies according to a preset copy exception recovery strategy, and repairing the copy in the storage cluster with the exception according to the selected copy, wherein the copy exception recovery strategy comprises the following steps: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy.
2. The method of claim 1, wherein the replica exception recovery policy comprises: a majority of consistent copy priority policy, wherein the selecting a copy from the acquired copies according to a preset copy exception recovery policy, and repairing the copy in the storage cluster in which the exception occurs according to the selected copy includes:
counting the number of copies with consistent copies;
and repairing the copy in the storage cluster with the exception according to the first copy set with the maximum number of copies.
3. The method of claim 2, wherein the counting the number of copies that are consistent comprises:
traversing each acquired copy, and sequentially extracting copy version number information;
and inquiring whether the extracted copy version number information is stored or not, if so, adding 1 to a copy number count value corresponding to the stored copy version number information, if not, constructing a mapping relation between the extracted copy version number information and the copy number count value, and setting the copy number count value mapped by the extracted copy version number information as 1.
4. The method according to claim 2 or 3, wherein the repairing the copy in the storage cluster in which the exception occurs according to the first copy set with the largest number of copies comprises:
comparing the data of each copy in the first copy set;
if the data of each copy in the first copy set is the same, selecting one copy in the first copy set, and repairing the copies, except the first copy set, in the storage cluster with the abnormality according to the selected copy;
if the data of the copies in the first copy set are different, selecting a second copy with the highest priority according to the priority of the copies in the first copy set, and repairing the copies except the second copy in the storage cluster with the abnormality according to the second copy.
5. The method of claim 4, wherein before selecting a second copy with a highest priority according to the priority of the copies in the first copy set after the data of the copies in the first copy set are different, the method further comprises:
respectively acquiring the data volume of each copy of the different data;
if the obtained data volume is the same, executing the step of selecting a second copy with the highest priority according to the priority of each copy in the first copy set;
and if the acquired data volume is different, repairing the copies except the copy with the maximum data volume in the abnormal storage cluster according to the copy with the maximum data volume.
6. The method according to claim 2 or 3, wherein the repairing the copy in the storage cluster in which the exception occurs according to the first copy set with the largest number of copies comprises:
acquiring the data volume of each copy in the first copy set;
if the data volume of each copy in the first copy set is the same, selecting one copy in the first copy set, and repairing the copies except the first copy set in the abnormal storage cluster according to the selected copy;
and if the data volume of each copy in the first copy set is different, repairing the copy except the copy with the maximum data volume in the abnormal storage cluster according to the copy with the maximum data volume.
7. The method of claim 1, wherein the replica exception recovery policy comprises: the multi-copy data volume priority strategy, wherein according to a preset copy exception recovery strategy, selecting one copy from the acquired copies, and repairing the copy in the storage cluster with exception according to the selected copy comprises:
counting the data volume of each copy;
and extracting the copy with the largest data amount, and repairing the copy except the copy with the largest data amount in the abnormal storage cluster according to the copy with the largest data amount.
8. The method of claim 1, wherein the replica exception recovery policy comprises: a copy priority policy, wherein the selecting a copy from the acquired copies according to a preset copy exception recovery policy, and repairing the copy in the storage cluster in which the exception occurs according to the selected copy includes:
acquiring the copy priority of each copy;
and sequencing the acquired copy priority, selecting the copy with the highest copy priority, and repairing the copies except the copy with the highest copy priority in the abnormal storage cluster according to the copy with the highest copy priority.
9. A device for recovering copy exception based on storage cluster is characterized in that the device is applied to a cluster comprising at least two cluster devices, wherein the at least two cluster devices comprise a host and a standby, and the device comprises: a copy acquisition module and a copy recovery module, wherein,
the copy acquisition module is used for acquiring various copies of a host and a standby in the storage cluster which are abnormal, wherein the storage cluster which is abnormal is powered off suddenly in the normal operation process of the storage cluster, and the data stored by various physical disks are inconsistent;
a copy recovery module, configured to select a copy from the acquired copies according to a preset copy exception recovery policy, and repair the copy in the storage cluster where the exception occurs according to the selected copy, where the copy exception recovery policy includes: one or any combination of a majority consistent replica prioritization policy, a replica data volume prioritization policy, and a replica priority prioritization policy.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of the preceding claims 1 to 8 are implemented by the processor when executing the computer program.
CN201810460225.3A 2018-05-15 2018-05-15 Storage cluster-based copy exception recovery method and device and computer equipment Active CN108647118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810460225.3A CN108647118B (en) 2018-05-15 2018-05-15 Storage cluster-based copy exception recovery method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810460225.3A CN108647118B (en) 2018-05-15 2018-05-15 Storage cluster-based copy exception recovery method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN108647118A CN108647118A (en) 2018-10-12
CN108647118B true CN108647118B (en) 2021-05-07

Family

ID=63755400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810460225.3A Active CN108647118B (en) 2018-05-15 2018-05-15 Storage cluster-based copy exception recovery method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN108647118B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113687980B (en) * 2020-05-19 2024-03-01 北京京东乾石科技有限公司 Abnormal data self-recovery method, system, electronic device and readable storage medium
CN113434340B (en) * 2021-06-29 2022-11-25 聚好看科技股份有限公司 Server and cache cluster fault rapid recovery method
CN113687981B (en) * 2021-08-18 2023-12-22 济南浪潮数据技术有限公司 Data recovery method, device, equipment and storage medium
CN116155594B (en) * 2023-02-21 2023-07-14 北京志凌海纳科技有限公司 Isolation method and system for network abnormal nodes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025758A (en) * 2009-09-18 2011-04-20 成都市华为赛门铁克科技有限公司 Method, device and system fore recovering data copy in distributed system
CN103530205A (en) * 2013-10-23 2014-01-22 曙光信息产业(北京)有限公司 Method and device for processing fault duplicate in multiple duplicates
WO2014078997A1 (en) * 2012-11-21 2014-05-30 华为技术有限公司 Method and device for repairing data
CN104699567A (en) * 2013-10-21 2015-06-10 国际商业机器公司 Method and system for recovering data objects in a distributed data storage system
CN105550229A (en) * 2015-12-07 2016-05-04 北京奇虎科技有限公司 Method and device for repairing data of distributed storage system
CN106201788A (en) * 2016-07-26 2016-12-07 乐视控股(北京)有限公司 Copy restorative procedure and system for distributed storage cluster
CN107977578A (en) * 2016-10-25 2018-05-01 中兴通讯股份有限公司 A kind of distributed memory system and its data recovery method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033786B (en) * 2010-11-04 2013-02-06 曙光信息产业股份有限公司 Method for repairing consistency of copies in object storage system
CN102024016B (en) * 2010-11-04 2013-03-13 曙光信息产业股份有限公司 Rapid data restoration method for distributed file system (DFS)
US9588847B1 (en) * 2014-03-25 2017-03-07 EMC IP Holding Company LLC Recovering corrupt virtual machine disks
CN105430327A (en) * 2015-11-05 2016-03-23 成都基业长青科技有限责任公司 NVR cluster backup method and device
CN106293980A (en) * 2016-07-26 2017-01-04 乐视控股(北京)有限公司 Data recovery method and system for distributed storage cluster

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025758A (en) * 2009-09-18 2011-04-20 成都市华为赛门铁克科技有限公司 Method, device and system fore recovering data copy in distributed system
WO2014078997A1 (en) * 2012-11-21 2014-05-30 华为技术有限公司 Method and device for repairing data
CN104699567A (en) * 2013-10-21 2015-06-10 国际商业机器公司 Method and system for recovering data objects in a distributed data storage system
CN103530205A (en) * 2013-10-23 2014-01-22 曙光信息产业(北京)有限公司 Method and device for processing fault duplicate in multiple duplicates
CN105550229A (en) * 2015-12-07 2016-05-04 北京奇虎科技有限公司 Method and device for repairing data of distributed storage system
CN106201788A (en) * 2016-07-26 2016-12-07 乐视控股(北京)有限公司 Copy restorative procedure and system for distributed storage cluster
CN107977578A (en) * 2016-10-25 2018-05-01 中兴通讯股份有限公司 A kind of distributed memory system and its data recovery method and device

Also Published As

Publication number Publication date
CN108647118A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108647118B (en) Storage cluster-based copy exception recovery method and device and computer equipment
US11023448B2 (en) Data scrubbing method and apparatus, and computer readable storage medium
US10817386B2 (en) Virtual machine recovery method and virtual machine management device
CN110543386B (en) Data storage method, device, equipment and storage medium
US9170888B2 (en) Methods and apparatus for virtual machine recovery
US20150213100A1 (en) Data synchronization method and system
US6785838B2 (en) Method and apparatus for recovering from failure of a mirrored boot device
CN110389858B (en) Method and device for recovering faults of storage device
CN107506266B (en) Data recovery method and system
CN112506702B (en) Disaster recovery method, device, equipment and storage medium for data center
CN114466027B (en) Cloud primary database service providing method, system, equipment and medium
US20190227710A1 (en) Incremental data restoration method and apparatus
CN111666266A (en) Data migration method and related equipment
EP3147789B1 (en) Method for re-establishing standby database, and apparatus thereof
CN108958965A (en) A kind of BMC monitoring can restore the method, device and equipment of ECC error
CN107015982A (en) A kind of method, device and the equipment of monitoring system file integrality
CN111342986B (en) Distributed node management method and device, distributed system and storage medium
CN110858168B (en) Cluster node fault processing method and device and cluster node
CN113326251B (en) Data management method, system, device and storage medium
CN117149527A (en) System and method for backing up and recovering server data
CN117290292A (en) Capacity expansion method, system and storage medium of file system inode
WO2017080362A1 (en) Data managing method and device
CN111737043A (en) Database disaster tolerance method, device, server and storage medium
CN110113395B (en) Shared file system maintenance method and device
CN107025150A (en) A kind of system and method for realizing the control of data backup real-time recovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant