CN114741220B - Disk isolation method, system, device and storage medium - Google Patents

Disk isolation method, system, device and storage medium Download PDF

Info

Publication number
CN114741220B
CN114741220B CN202210329750.8A CN202210329750A CN114741220B CN 114741220 B CN114741220 B CN 114741220B CN 202210329750 A CN202210329750 A CN 202210329750A CN 114741220 B CN114741220 B CN 114741220B
Authority
CN
China
Prior art keywords
disk
view
state
preset
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210329750.8A
Other languages
Chinese (zh)
Other versions
CN114741220A (en
Inventor
邹全
徐文豪
王弘毅
张凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SmartX Inc
Original Assignee
SmartX Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SmartX Inc filed Critical SmartX Inc
Priority to CN202210329750.8A priority Critical patent/CN114741220B/en
Publication of CN114741220A publication Critical patent/CN114741220A/en
Application granted granted Critical
Publication of CN114741220B publication Critical patent/CN114741220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process

Abstract

The invention provides a disk isolation method, a system, equipment and a storage medium. The disk isolation method specifically comprises the following steps: according to the working state of the disk of each node in the cluster, marking the disk in the abnormal state as a fault disk, storing the abnormal state and abnormal data of the disk in a central database, acquiring a view state result of the fault disk according to a preset isolation constraint condition of the fault disk, marking the fault disk as a disk to be isolated under the condition that the fault disk is updated to the view state, and isolating the disk to be isolated and deducting the storage capacity corresponding to the isolation disk by the storage system. According to the technical scheme, the early warning can be effectively provided for the storage system before the fault disk is isolated by the storage system and before data loss actually occurs, the data on the disk is processed in advance, and the safety of data copies is guaranteed while the overall IO performance of the cluster is maintained.

Description

Disk isolation method, system, device and storage medium
Technical Field
The invention belongs to the technical field of data storage, and particularly relates to a disk isolation method, a disk isolation system, disk isolation equipment and a disk isolation storage medium.
Background
The disk is used as a place where data is finally stored in the distributed storage system, the abnormal state of the disk may cause system processes to be affected in the storage system, and particularly, the abnormal state of the disk occurs in the running process of the disk but no read-write error occurs, data migration cannot be performed on a failed disk in the abnormal state in time, and the failed disk in the abnormal state is isolated in time, so that the read-write performance of the storage system may be significantly affected, the read-write performance of the storage system is reduced, and the overall IO performance of a cluster is further affected.
In the prior art, common disk detection methods and isolation behaviors are usually performed for a single disk and a single server, the method can only perform judgment or prejudgment on a disk of a single node, capacity information reported to a cluster main control center by the single node may be delayed, the security of data copies on multiple nodes of a cluster cannot be guaranteed, the security of the data copies can be threatened due to abnormal state fault disk isolation, and early warning cannot be given to a storage system in time when the disk is abnormal for data on the nodes of a multi-node cluster, so that the data security on the disk is greatly reduced.
Meanwhile, when the data storage service of the storage system detects a disk, the detected failed disk directly removes the disk, the storage service on a single node crashes, the storage network of the single node is disconnected, the disk fails, the disk directly disappears from the operating system level, and the like, which all cause the cluster capacity to be reduced.
Disclosure of Invention
Based on the problems that in the prior art, a failed disk in an abnormal state cannot be isolated in time, data migration is not in time, the overall IO performance of a cluster is affected, and the data security of the disk is reduced, the disk isolation method, the system, the device and the storage medium are provided.
In a first aspect of the present application, a disk isolation method is provided, which specifically includes:
acquiring the working state of a disk of each node in a cluster, wherein the working state comprises a normal state and an abnormal state;
under the condition that any disk with a node is in an abnormal state, marking the disk as a fault disk, and storing the abnormal state and abnormal data of the fault disk into a central database;
acquiring a view state result of the fault disk according to a preset isolation constraint condition of the fault disk;
and under the condition that the view state result indicates that the marked fault disk is the disk to be isolated, the storage system receives a disk isolation instruction and isolates the disk to be isolated. In a possible implementation of the first aspect, the preset isolation constraint condition of the failed disk includes:
disk capacity presetting constraint conditions the disk capacity presetting constraint conditions include: a constraint of a first preset capacity threshold and a constraint of a second preset capacity threshold;
and presetting view change conditions, wherein the preset view change conditions are determined according to the view change actions of the nodes.
In one possible implementation of the first aspect described above,
acquiring a view state result of the failed disk according to a preset isolation constraint condition of the failed disk comprises the following steps:
acquiring residual storage of a cluster whole disk and accumulated deduction storage of the cluster whole disk;
judging whether the residual storage of the cluster whole disk is larger than a first preset capacity threshold value or not, and marking the fault disk larger than the first preset capacity threshold value as a first abnormal disk;
judging whether the accumulated deduction storage of the cluster whole disk is smaller than a second preset capacity threshold value or not, and marking the first abnormal disk smaller than the second preset capacity threshold value as a second abnormal disk;
judging whether the view change of the node to the second abnormal disk meets a preset view change condition or not, and updating the second abnormal disk meeting the preset view change condition to the view;
and acquiring a view state result of the second abnormal disk, and displaying the view state result as a view state result of the fault disk as the disk to be isolated.
In a possible implementation of the first aspect, the changing, by the node, the view of the second anomalous disk that satisfies the preset view changing condition includes:
acquiring a first view state of a view saved by a node and a second view state of a view of which the node executes a view change action on a second abnormal disk in the current period,
and under the condition that the first view state is consistent with the second view state, the view change of the node to the second abnormal disk meets the preset view change condition.
In one possible implementation of the first aspect described above,
obtaining the view state result of the failed disk according to the preset isolation constraint condition of the failed disk further comprises:
view change operation executed by the node sequentially updates the views according to non-overlapping period accumulation change, and obtains a view state result of the fault disk
In one possible implementation of the first aspect described above,
the condition of marking the fault disk as the disk to be isolated further comprises:
and acquiring the time sequence relation of the abnormal data of the fault disk according to a period, and marking the fault disk meeting the preset isolation constraint condition as the disk to be isolated.
In one possible implementation of the first aspect described above,
the condition of marking the fault disk as the disk to be isolated further comprises:
under the condition that at least two disks corresponding to the same data copy are marked as fault disks, any fault disk meeting preset isolation constraint conditions is marked as a first disk to be isolated;
and under the condition that the backup of the data copy of the first disk to be isolated is completed, executing the marking of the second disk to be isolated.
A second aspect of the present application provides a disk isolation system, which is applied to the disk isolation method of the first aspect, and the disk isolation system specifically includes:
a disk state acquisition module: acquiring the working state of a disk of each node in a cluster, wherein the working state comprises a normal state and an abnormal state;
a disk exception storage module: under the condition that any disk with a node is in an abnormal state, marking the disk as a fault disk, and storing the abnormal state and abnormal data of the fault disk into a central database;
a disk exception handling module: acquiring a view state result of the fault disk according to a preset isolation constraint condition of the fault disk;
a disk exception isolation module: a disk exception isolation module: and under the condition that the view state result indicates that the marked fault disk is the disk to be isolated, the storage system receives a disk isolation instruction and isolates the disk to be isolated.
A third aspect of the present application provides an electronic device comprising:
a memory for storing a processing program;
a processor, which is applied to the disk isolation method according to the first aspect when the processor executes the processing program.
A third aspect of the present application provides a readable storage medium, on which a processing program is stored, and when the processing program is executed by a processor, the processing program is applied to a disk isolation method according to the first aspect.
The application has the following beneficial technical effects:
according to the technical scheme, the abnormal states of the disks of the nodes in the storage cluster can be stored based on the central database, the states of all the nodes in the current cluster are combined, the fault disks in the abnormal states are processed through multiple nodes, the fault disks meeting preset isolation conditions are marked as disks to be isolated, in each detection period, the storage capacity of the current disk and the accumulated deduction storage of the whole disks of the cluster are judged in real time, then the storage system is isolated timely from the fault disks, the early warning is effectively provided for the storage system before the real occurrence of data loss, the data on the disks are processed in advance, and the safety of data copies is guaranteed while the good overall IO performance of the cluster is maintained.
Drawings
Other features, objects and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments thereof, which is to be read in connection with the accompanying drawings
FIG. 1 illustrates a flow diagram of a disk isolation method, according to an embodiment of the present application;
FIG. 2 illustrates a flow diagram of a preset isolation constraint, according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a flow chart of whether a view change of a failed disk meets a preset view change according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a process of marking a failed disk as a disk to be isolated according to an embodiment of the present application;
FIG. 5 illustrates a system block diagram of a disk isolation system, according to an embodiment of the present application.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will aid those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any manner. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same objects. Other explicit and implicit definitions are also possible below.
In order to solve the problems that the safety of a disk storage data copy is reduced due to untimely disk isolation, the overall IO performance of a cluster is reduced due to the read-write delay of a fault disk and the like in the prior art, the application provides a disk isolation method, a disk isolation system, a disk isolation device and a storage medium. By the disk isolation method, early warning can be effectively provided for the storage system before data loss actually occurs, data on the disk is processed in advance, and the safety of data copies is guaranteed while the overall IO performance of the cluster is kept better.
Specifically, fig. 1 shows a schematic flowchart of a disk isolation method according to some embodiments of the present application, which specifically includes:
step S1: and acquiring the working state of the disk of each node in the cluster, wherein the working state comprises a normal state and an abnormal state. It is understood that the life cycle of the disks in the distributed storage system generally goes through three phases, the first phase is a phase without any exception, and is generally called a normal state; the second stage is a stage in which various abnormalities occur, such as bad track of a magnetic disk, high temperature, slow disk, stuck, various read-write errors and the like, and is generally called as a sub-health state; the third stage is a fault stage that the disk function is in a stagnation state, and the disks in the second or third stage are in an abnormal state.
Step S2: and under the condition that any disk of the nodes is in an abnormal state, marking the disk as a fault disk, and storing the abnormal state and abnormal data of the fault disk into a central database. It can be understood that information such as IO related data and logs of the disk and the disk in the second stage abnormal state affects IO performance of the storage system, and the disk fails in a period of time, and at this time, it can be considered that the disk in such an abnormal state is a failed disk, multiple disks of multiple nodes in the cluster store corresponding data copies, and start to perform operations such as data migration, and store and mark a specific abnormal state of the failed disk and corresponding abnormal data in the central database, and write the specific abnormal state and corresponding abnormal data into the view, which are not taken into consideration for isolation so as to perform corresponding operations on the failed disk in combination with states of other nodes in the cluster, thereby avoiding risks caused by post-processing after the failure occurs, for example, when the disk fails to be processed after the failure occurs, other disk failures may occur in the processing process, thereby causing problems such as data loss caused by multiple point failures.
And step S3: and acquiring a view state result of the fault disk according to the preset isolation constraint condition of the fault disk. It can be understood that, in order to ensure the overall data security of the cluster, when a failure occurs in a node of the cluster, each node needs to know the current state of other nodes in the cluster before marking any disk as a failure, and by judging whether the storage space of the node in the cluster is in isolation of the failed disk, normal data reading and writing can be realized, and whether most disks of the cluster are already in isolation is judged, and if the disks are further isolated, the space of the storage system is already smaller than the storage space provided by a single node in the cluster, which results in poor performance of the storage system. Further, the safety of the data copy is reduced because the number of the isolated disks is too large at the same time or the disks storing the same data copy are isolated successively, so that the isolation of the disks in the storage system process is determined according to the preset isolation constraint condition of the failed disk and further according to the view state result of the failed disk under the condition that the preset isolation constraint condition is met.
And step S4: and under the condition that the view state result indicates that the marked fault disk is the disk to be isolated, the storage system receives a disk isolation instruction and isolates the disk to be isolated. It can be understood that, when the view state result indicates that the failed disk is the disk to be isolated, the storage system is notified according to the current view state, a corresponding isolation request is generated, and an intervention request is sent to the storage system, and the storage system receives a corresponding isolation instruction, that is, responds to the intervention, and then executes isolation of the disk to be isolated, and deducts the storage capacity corresponding to the isolation disk. When the view state is successfully updated and the view updating action is taken into consideration for isolating the fault disk, all nodes need to know the information, the storage system intervenes for processing, and simultaneously executes disk isolation and deducts the storage capacity of the corresponding disk to be isolated, and the corresponding isolated disk does not participate in the data storage corresponding to the disk any more.
In some embodiments of the present application, when the monitored data of the disk does not meet the preset standard value, that is, the disk is considered to be in a fault state, and at this time, the disk should be in an abnormal state at the second stage, and the specific content of the monitored data and the number of the abnormal information items, which can be obtained through one type of monitoring operation, should be determined according to the actual requirements, and are not specifically limited herein. For example, the read-write performance can be delayed by detecting that the read-write performance of the disk is smaller than a preset read-write performance index; s.m.a.r.t. (Self-Monitoring), monitoring by a sensor counter, monitoring by an operating system abnormal log, monitoring by a disk expansion card state, monitoring by a motherboard out-of-band system, and the like, and if the data information does not satisfy a preset threshold value of the normal operating state of the disk, determining that the disk is in an abnormal state.
In step S3, the preset isolation constraint conditions of the failed disk include: presetting constraint conditions and preset view change conditions for the disk capacity; the preset constraint conditions of the disk capacity comprise: a constraint of a first preset capacity threshold and a constraint of a second preset capacity threshold. It can be understood that the disk storage capacity of a plurality of nodes in a cluster can be gradually reduced along with the isolation capacity of a disk, for abnormal state failure disk isolation, the disk storage space of the plurality of nodes in the cluster and the space of an isolated disk or other disks which cannot store data copies need to be considered, and when a view complies with the two constraints, the disks in the cluster are isolated, so that the operation of a storage system and the security of the data copies in the disks are not affected by isolating the failed disks.
In the above embodiment, the view change condition is preset, and the preset view change condition is determined according to the view change of the node. The act of each node performing a change to the view includes at least: updating the detected fault disk into a view; and removing the disk removed from the cluster by the administrator from the view, and updating the view state according to the state machine path and writing the view state into the next state according to the processing of the disk by the storage system, wherein the disk meeting the preset isolation constraint condition is marked as a disk to be isolated, and the like.
In some embodiments of the present application, a specific implementation of step S3 above will be further explained and illustrated below:
specifically, fig. 2 shows a schematic flow chart of a preset isolation constraint condition according to an embodiment of the present application, which may be applied to step S3, and the obtaining a view state result of a failed disk according to the preset isolation constraint condition of the failed disk further includes:
step S31a: and acquiring the residual storage of the cluster whole disk and the accumulated deduction storage of the cluster whole disk. It can be understood that, in order to ensure the overall data security of the cluster, before each node changes the view of the central database storing the abnormal data of the failed disk, it is necessary to perform a preliminary calculation on the disk capacity of the cluster due to the possibility of isolating the failed disk in the cluster, so as to perform a preliminary evaluation on the working performance of the storage system and the security of the data copy in the case that the failed disk is isolated.
Step S32a: and judging whether the residual storage of the cluster whole disk is larger than a first preset capacity threshold, and marking the fault disk larger than the first preset capacity threshold as a first abnormal disk. It can be understood that, first, the remaining capacity of the cluster whole disk after deducting the capacity of the failed disk corresponding to the current node needs to be obtained, whether the remaining storage of the cluster whole disk is greater than a first preset capacity threshold is compared, and when the remaining storage of the cluster whole disk is greater than the first preset capacity threshold, it is determined that the failed disk meets a first condition of a preset abnormal flag, that is, the failed disk is marked as a first abnormal disk.
In some embodiments of the present application, the remaining storage of the cluster whole disk may be the remaining storage space of the disk, or may also be the occupancy proportion of the remaining storage space of the cluster whole disk in the cluster, which is not limited herein.
In some embodiments of the present application, the first preset capacity threshold may be preset according to an actual storage capacity space of a disk or a data copy storage capacity of a current storage system, and a capacity value of a specific storage space may also be a preset value of a percentage of remaining storage of a cluster whole disk, and is set according to the remaining storage of the cluster whole disk, which is not limited herein.
In some embodiments of the present application, in a case that the cluster whole disk remaining storage is the cluster whole disk remaining storage space, the cluster whole disk remaining storage is equal to the cluster current remaining storage capacity except for the capacities of all disks that need to be isolated in the view and the capacities of all disks that have been subjected to the intervention processing by the storage system in the view, and at this time, the first preset capacity threshold may be set according to the current cluster storage capacity space and the current data storage capacity.
In some embodiments of the present application, in a case that the cluster overall disk remaining storage may be a cluster overall disk remaining storage occupancy ratio, the cluster overall disk remaining storage is equal to a ratio of a remaining storage capacity after deducting the capacity of the failed disk to a total storage capacity of the cluster remaining after deducting the capacity of the failed disk, where the first capacity threshold is a preset ratio, for example, a ratio of the remaining capacity after deducting the capacity of the failed disk to the total capacity of the cluster remaining after deducting the capacity of the failed disk is at least greater than 0.1.
Further, the remaining total storage capacity of the cluster after deducting the capacity of the failed disk is equal to the current total capacity of the cluster, except for the capacity of all disks needing isolation in the view and the capacity of all disks in the view where the storage system has been involved in processing.
Step S33a: and judging whether the accumulated deduction storage of the cluster whole disk is smaller than a second preset capacity threshold, and marking the first abnormal disk smaller than the second preset capacity threshold as a second abnormal disk. It can be understood that a second constraint condition also exists for the disk marked as the first abnormal disk, that is, the first abnormal disk needs to be taken as an isolated disk which cannot store data and is brought into the cluster for calculation of disk cumulative deduction, so as to ensure that the storage system does not have the capacity provided by a single node in the cluster to the storage pool to be insufficient for satisfying the data storage of the current second abnormal disk because the number of disks currently in isolation is too large, the cluster whole disk cumulative deduction storage needs to be further calculated, and it is determined that the failed disk satisfies the second condition of the preset abnormal mark, that is, the disk marked as the second abnormal disk, when the number of disks currently in isolation is smaller than a second preset capacity threshold.
In some embodiments of the present application, the second preset capacity threshold may be preset according to a storage capacity space that is obtained by deducting the disk accumulatively, and may be set for an accumulated deduction storage capacity value of a specific storage space according to an actual cluster overall disk accumulated deduction storage, which is not limited herein.
In some embodiments of the present application, the current aggregate deduction storage of the cluster disks is equal to the capacities of all disks to be isolated in the view of the current central database, the capacities of all disks in the view in which the storage systems have been involved in processing, and the capacities of all subtracted disks in the view, and the second preset threshold may be compared with the storage capacities of the single nodes in the cluster according to the aggregate deduction storage of the cluster disks. For example, the second threshold may be set as a storage capacity of a single node in the cluster, which is not limited herein.
Step S34a: and judging whether the view change of the node to the second abnormal disk meets a preset view change condition or not, and updating the second abnormal disk meeting the preset view change condition to the view. It can be understood that, when each node in the cluster executes a detection failure disk, the detected failure disk will be added into the view, and the failure disk added into the view can execute a corresponding view change when corresponding change conditions are met, so as to avoid that two or more nodes in the cluster do not recognize that the view has been changed by other nodes, and send out a view change action at the same time, the state of the view is changed for multiple times in the same period, and it is necessary to further determine whether the view change to be executed by the second abnormal disk newly meets preset view change conditions, where the preset view change conditions will be described in detail below.
Step S35a: and acquiring a view state result of the updated second abnormal disk, and displaying the view state result of the fault disk as the disk to be isolated. It is to be understood that the act of each node performing a change to the view includes at least the foregoing: and updating the view, removing the view, writing the view state machine path, and the like, wherein the view state is updated to be the state of the disk to be isolated, and the view state result of the second abnormal disk is obtained according to the view state update of the central database. When the second abnormal disk meets the preset view change condition, that is, the current view state is not changed by other nodes in the current period, and the state of the corresponding view is still the view state saved in the previous period, at this time, the node executes the view update request, and can obtain the view state result based on the second abnormal disk update view according to the state update of the view of the node on the central database, for example, when the view is removed or a failure view is added, the condition of changing the preset view does not need to be met, and the view change including adding the view, removing the view and the like can be directly executed.
In some embodiments of the present application, when a failed disk in an abnormal state is detected, data is not actually lost, and the security of a subsequent data copy needs to be ensured by a storage system itself, when a view state of the failed disk is updated, especially when a label of a disk to be isolated is pointed to, factors that the cluster whole disk remains stored and the cluster whole disk accumulates and subtracts storage, and a change condition of a preset view are needed, where the change of the view may include detecting states of a failure, a failed disk needs to be isolated, a storage system has been intervened in processing, a disk capacity has been subtracted, disk data has been backed up, and the like, and specifically, it is described in table 1 below whether the factors that the cluster whole disk remains stored and the cluster whole disk accumulates and subtracts storage from needs to be considered or not according to the view state change:
TABLE 1
Figure BDA0003574786500000081
Figure BDA0003574786500000091
In the above embodiment, steps S31a to S35a are sequentially executed, when a view is changed to a failed disk and a result of updating the view state indicates that the failed disk needs to be isolated, at this time, a new failed disk is marked as needing to be isolated, remaining storage of the entire cluster disk and accumulated deduction storage of the entire cluster disk need to be considered, and further, when a preset change condition is met, the new failed disk is marked as a disk to be isolated, a storage system performs intervention processing to backup data and isolate the disk, and the disk isolated by the storage system does not store data any more and is not directly considered as storage but as accumulated deduction storage of the disk.
In the foregoing embodiment, based on step S35a, further, fig. 3 shows a flowchart illustrating that a view change of a failed disk meets a preset view change according to an embodiment of the present application, specifically, the step of enabling a view change of a node to a failed disk to meet a preset view change condition includes:
step S35a1: and acquiring a first view state of a view saved by the node and a second view state of the view of which the node executes a view change action on the second disk in the current period. It can be understood that, for executing the view change, it is required to determine whether the current view state has other nodes to execute the view change in one cycle, so as to further determine whether the to-be-isolated state of the second abnormal disk can be updated into the view, based on the states of the node view changes that are separate from each other, a state number of the view acquired by each node is recorded at each node, when the node queries and updates the view again by means of the number, if the view state has changed compared with the data state recorded by the current node, it is described that the view has been changed by other nodes in the cluster, therefore, the current node fails to change the view, at this time, the failed disk does not update the corresponding view, it is required that the current node reads a new view, and performs a new round of conditional constraint calculation according to the view, and retries the view change operation.
Step S35a2: and under the condition that the first view state is consistent with the second view state, the view change of the node to the second abnormal disk meets the preset view change condition. It can be understood that if no other node participates in the change, the node may perform the change on the view in this period. Otherwise, the node gives up the change of the period, reads the view again, recalculates the state result in the next period, and performs the same judgment and executes new change.
In the above embodiment, specifically, for example, the current data node 1 detects a failed disk and marks the failed disk as a second abnormal disk, at this time, whether the second abnormal disk can be stored in the central database to implement view change or not, and marks the corresponding second abnormal disk as a disk to be isolated, before the current node 1 performs view change, the state of the view that needs to be changed is read as a state A1, and is consistent with the view state A1 of the detection record in the previous cycle, that is, the detection cycle, the current node considers that the current view is not changed by other nodes, at this time, the current node may perform current view change, and writes the detected second abnormal disk into the current view, at this time, the state of the view is changed from A1 to B1, and in the same detection cycle, the state B1 of the view is not changed, and further, when the view state is the state to be isolated, at this time, before the storage system intervenes, the state of the current view is not changed.
In some embodiments of the present application, a specific implementation of step S3 above will be further explained and explained as follows: according to the preset isolation constraint condition of the failed disk, obtaining the view state result of the failed disk further comprises: and the view change operation executed by the node sequentially updates the views according to the non-overlapping period accumulation change, and obtains the view state result of the fault disk. It can be understood that only one node can mark one disk as to be isolated in one period of the same node, and multiple nodes can mark multiple disks in multiple periods, but each view change operation is such a non-overlapping period accumulation change.
Specifically, for example, when the node 1 and the node 2 are both in a respective probing cycle, and when the cycle times T1 and T2 overlap, the node 1 and the node 2 both need to change the view state from A1 in a current respective cycle, and when the node 1 changes the view state from A1 to B1 and completes execution, the value of the view state B1 is sent to the node 2 before the change occurs, and the node 2 still stores the view state A1, so that the node 2 fails to query the current view change, and the node 2 fails to change the view state A1, reads the view state B1, stores the view state B1, and re-executes the query change in the next cycle.
In some embodiments of the present application, the marking the failed disk as the disk to be isolated further includes: and acquiring the time sequence relation of the abnormal data of the fault disk according to a period, and marking the fault disk meeting the preset isolation constraint condition as the disk to be isolated. It can be understood that, in a view change, in a complete cycle of any node, the state of at most one disk can be changed from the fault detection to the state of the disk needing to be isolated and marked as a disk to be isolated according to the sequence of the time sequence, when the disk is marked as the disk to be isolated, the remaining storage and the accumulated deduction storage of the whole cluster disk need to be further judged, when a plurality of disks need to be considered in the whole cluster, a fault disk meeting the preset isolation condition is selected according to the time sequence relation to be subjected to the isolation marking, and for the change of other states, including the fault detection, the intervention processing of the storage system and the view state with the disk capacity deducted can be performed by a plurality of disks at the same time.
Fig. 4 shows a schematic flowchart of a process for marking a failed disk as a disk to be isolated according to an embodiment of the present application. The condition of marking the fault disk as the disk to be isolated further comprises:
step S31b: under the condition that at least two disks corresponding to the same data copy are marked as fault disks, any fault disk meeting preset isolation constraint conditions is marked as a first disk to be isolated. It can be understood that, when the second abnormal disk is marked, but the storage system is not yet involved before, the number copy on the second abnormal disk is still in a state that changes at any time, for the case that the second abnormal disk is updated to a view, two disks bearing two copies of the same data block may occur, isolation occurs successively on at least two nodes in the cluster, a plurality of data copies of the storage system need to clearly sense the current state, and the situation that the redundancy of the data copies of the disks is insufficient is avoided, and the data copies are not migrated completely out of the failed disk in time, resulting in data loss.
Specifically, the data copy in one period can be obtained by continuously obtaining a snapshot at a certain time, so that the isolated data copy in one period can be obtained by the snapshot at the certain time.
Step S32b: and under the condition that the data copy of the first disk to be isolated is completely backed up, marking the second disk to be isolated. It can be understood that, in the case that migration of the data copy stored in the second abnormal disk is not completed, it is necessary to allow the data copy to be read when a read-write request for the data block arrives, and write the data copy may be directly performed on other non-isolated disks, and only when the first disk to be isolated is intervened by the storage system and the data backup is completed, the remaining failed disks are sequentially determined by the preset isolation condition constraint, and may be marked as the second disk to be isolated.
Specifically, it can be understood that, in a view change, if overlapping data copies of at least two failed hard disks are detected under the condition that multiple data copies are repeated, in order to ensure the security of the data copies, in the process of implementing the disk marking for isolation according to preset isolation constraint conditions, when a failed disk is marked as a disk to be isolated, the remaining failed disks having the same data copy will not be marked as disks to be isolated, and the state of at most one disk of the failed disks having the same data copy can be changed from the state of detecting the failure to the state of the failed disk to be isolated.
In some embodiments of the present application, fig. 5 shows a system block diagram of a disk isolation system according to an embodiment of the present application, where the system block diagram is applied to the disk isolation method provided in the foregoing embodiment, and the disk isolation system may specifically include:
the disk state acquisition module 1: acquiring the working state of a disk of each node in a cluster, wherein the working state comprises a normal state and an abnormal state;
the disk exception storage module 2: under the condition that any disk with a node is in an abnormal state, marking the disk as a fault disk, and storing the abnormal state and abnormal data of the fault disk into a central database;
the view state updating module 3: acquiring a view state result of the fault disk according to a preset isolation constraint condition of the fault disk;
disk exception isolation module 4: and under the condition that the view state result indicates that the marked fault disk is the disk to be isolated, the storage system isolates the disk to be isolated and deducts the storage capacity corresponding to the isolated disk.
It can be understood that each functional module of the disk isolation system executes the same step flow as the isolation method, and details are not described herein.
In some embodiments of the present application, an electronic device is also provided. The electronic device comprises a memory and a processor, wherein the memory is used for storing a processing program, and the processor executes the processing program according to instructions. When the processor executes the processing program, the disk isolation method in the foregoing embodiment is implemented.
In some embodiments of the present application, a readable storage medium is also provided, which may be a non-volatile readable storage medium or a volatile readable storage medium. The readable storage medium has stored therein instructions that, when executed on a computer, cause an electronic device containing such readable storage medium to perform the aforementioned disk isolation method.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
Technical solutions presented herein relate to methods, apparatuses, systems, electronic devices, computer-readable storage media, and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.
The computer-readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments disclosed herein. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
According to the technical scheme provided by the application, abnormal states of disks of nodes in a storage cluster can be stored based on a central database, states of all nodes in the current cluster are combined, a fault disk in the abnormal state is processed through multiple nodes, the fault disk meeting preset isolation conditions is marked as a disk to be isolated, in each detection period, the current disk storage capacity and the accumulated deduction storage of the whole disk of the cluster are judged in real time, then the storage system is isolated timely from the fault disk, early warning is effectively provided for the storage system before real data loss occurs, data on the disk are processed in advance, and the safety of a data copy is guaranteed while the good overall IO performance of the cluster is maintained.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (8)

1. A disk isolation method, comprising:
acquiring the working state of a disk of each node in a cluster, wherein the working state comprises a normal state and an abnormal state;
under the condition that any one disk of the nodes is in the abnormal state, the disk is marked as a fault disk, and the abnormal state and abnormal data of the fault disk are stored in a central database;
acquiring a view state result of the fault disk according to a preset isolation constraint condition of the fault disk;
under the condition that the view state result marks that the fault disk is a disk to be isolated, the storage system receives a disk isolation instruction and isolates the disk to be isolated;
the preset isolation constraint conditions of the failed disk comprise:
disk capacity preset constraint conditions the disk capacity preset constraint conditions include: a constraint of a first preset capacity threshold and a constraint of a second preset capacity threshold;
presetting view changing conditions, wherein the preset view changing conditions are determined according to view changing actions of the nodes;
acquiring a view state result of the failed disk according to the preset isolation constraint condition of the failed disk comprises the following steps:
acquiring residual storage of the cluster whole disk and accumulated deduction storage of the cluster whole disk;
judging whether the residual storage of the cluster whole disk is larger than a first preset capacity threshold value or not, and marking the fault disk larger than the first preset capacity threshold value as a first abnormal disk;
judging whether the accumulated deduction storage of the cluster whole disk is smaller than a second preset capacity threshold value or not, and marking the first abnormal disk smaller than the second preset capacity threshold value as a second abnormal disk;
judging whether the view change of the node to the second abnormal disk meets a preset view change condition or not, and updating the second abnormal disk meeting the preset view change condition to the view;
and obtaining the view state result of updating the second abnormal disk, and displaying the view state result as the view state result of marking the fault disk as the disk to be isolated.
2. The disk isolation method according to claim 1, wherein the step of changing the view of the second anomalous disk by the node to satisfy the preset view change condition includes:
acquiring a first view state of the view saved by the node and a second view state of the view of which the node executes a view change action on the second abnormal disk in the current period,
and when the first view state is consistent with the second view state, the view change of the node to the second abnormal disk meets the preset view change condition.
3. The disk isolation method according to claim 1, wherein obtaining the view state result of the failed disk according to the preset isolation constraint condition of the failed disk comprises:
and the view change operation executed by the node sequentially updates the views according to non-overlapping periodic accumulated change, and the view state result of the fault disk is obtained.
4. The method of claim 1, wherein the step of marking the failed disk as the disk to be isolated further comprises:
and acquiring the time sequence relation of the abnormal data of the fault disk according to a period, and marking the fault disk meeting a preset isolation constraint condition as the disk to be isolated.
5. The method of claim 1, wherein the step of marking the failed disk as the disk to be isolated further comprises:
under the condition that at least two disks corresponding to the same data copy are marked as the fault disks, any fault disk meeting the preset isolation constraint condition is marked as a first disk to be isolated;
and under the condition that the data copy of the first disk to be isolated is completely backed up, executing marking of a second disk to be isolated.
6. A disk isolation system, which is applied to the disk isolation method according to any one of claims 1 to 5, the disk isolation system specifically includes:
a disk state acquisition module: acquiring the working state of a disk of each node in a cluster, wherein the working state comprises a normal state and an abnormal state;
a disk exception storage module: under the condition that any disk of the node is in the abnormal state, marking the disk as a fault disk, and storing the abnormal state and abnormal data of the fault disk into a central database;
the view state updating module: acquiring a view state result of the fault disk according to a preset isolation constraint condition of the fault disk;
a disk exception isolation module: and under the condition that the view state result marks that the fault disk is the disk to be isolated, the storage system receives a disk isolation instruction and isolates the disk to be isolated.
7. An electronic device, comprising:
a memory for storing a processing program;
a processor implementing the disk isolation method of any one of claims 1 to 5 when executing the handler.
8. A readable storage medium having stored thereon a processing program which, when executed by a processor, implements the disk isolation method according to any one of claims 1 to 5.
CN202210329750.8A 2022-03-31 2022-03-31 Disk isolation method, system, device and storage medium Active CN114741220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210329750.8A CN114741220B (en) 2022-03-31 2022-03-31 Disk isolation method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210329750.8A CN114741220B (en) 2022-03-31 2022-03-31 Disk isolation method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN114741220A CN114741220A (en) 2022-07-12
CN114741220B true CN114741220B (en) 2023-01-13

Family

ID=82278291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210329750.8A Active CN114741220B (en) 2022-03-31 2022-03-31 Disk isolation method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN114741220B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116155594B (en) * 2023-02-21 2023-07-14 北京志凌海纳科技有限公司 Isolation method and system for network abnormal nodes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104239A (en) * 2019-11-21 2020-05-05 北京浪潮数据技术有限公司 Hard disk fault processing method, system and device for distributed storage cluster
CN113625945A (en) * 2021-06-25 2021-11-09 济南浪潮数据技术有限公司 Distributed storage slow disk processing method, system, terminal and storage medium
CN113672415A (en) * 2021-07-09 2021-11-19 济南浪潮数据技术有限公司 Disk fault processing method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6957303B2 (en) * 2002-11-26 2005-10-18 Hitachi, Ltd. System and managing method for cluster-type storage
CN105094684B (en) * 2014-04-24 2018-03-09 国际商业机器公司 The method for reusing and system of problem disk in disc array system
CN112905119B (en) * 2021-02-19 2022-10-28 山东英信计算机技术有限公司 Data write-in control method, device and equipment of distributed storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104239A (en) * 2019-11-21 2020-05-05 北京浪潮数据技术有限公司 Hard disk fault processing method, system and device for distributed storage cluster
CN113625945A (en) * 2021-06-25 2021-11-09 济南浪潮数据技术有限公司 Distributed storage slow disk processing method, system, terminal and storage medium
CN113672415A (en) * 2021-07-09 2021-11-19 济南浪潮数据技术有限公司 Disk fault processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114741220A (en) 2022-07-12

Similar Documents

Publication Publication Date Title
US20140215262A1 (en) Rebuilding a storage array
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN110537170B (en) Method, system and computer readable storage device for analyzing large scale data processing jobs
CN114741220B (en) Disk isolation method, system, device and storage medium
CN108243031B (en) Method and device for realizing dual-computer hot standby
CN104461791A (en) Information processing method and device
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN106375114B (en) A kind of hot plug fault restoration methods and distributed apparatus
US20130080821A1 (en) Proactively removing channel paths in error from a variable scope of i/o devices
US10606490B2 (en) Storage control device and storage control method for detecting storage device in potential fault state
CN108170375B (en) Overrun protection method and device in distributed storage system
JP6880961B2 (en) Information processing device and log recording method
US20140201566A1 (en) Automatic computer storage medium diagnostics
CN111159051B (en) Deadlock detection method, deadlock detection device, electronic equipment and readable storage medium
JP5849491B2 (en) Disk control device, disk device abnormality detection method, and program
CN113625957B (en) Method, device and equipment for detecting hard disk faults
CN110659147A (en) Self-repairing method and system based on module self-checking behavior
KR19990062427A (en) Computer system having failover function and method thereof
CN107273291B (en) Processor debugging method and system
US20220374310A1 (en) Write request completion notification in response to partial hardening of write data
CN109542687B (en) RAID level conversion method and device
CN112084097A (en) Disk warning method and device
CN108231134B (en) RAM yield remediation method and device
JP2012108848A (en) Operation log collection system and program
JP4562641B2 (en) Computer system, operation state determination program, and operation state determination method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant