CN111104283A - Fault detection method, device, equipment and medium of distributed storage system - Google Patents

Fault detection method, device, equipment and medium of distributed storage system Download PDF

Info

Publication number
CN111104283A
CN111104283A CN201911207102.XA CN201911207102A CN111104283A CN 111104283 A CN111104283 A CN 111104283A CN 201911207102 A CN201911207102 A CN 201911207102A CN 111104283 A CN111104283 A CN 111104283A
Authority
CN
China
Prior art keywords
storage system
distributed storage
fault
threshold value
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911207102.XA
Other languages
Chinese (zh)
Other versions
CN111104283B (en
Inventor
甄天桥
孟祥瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201911207102.XA priority Critical patent/CN111104283B/en
Publication of CN111104283A publication Critical patent/CN111104283A/en
Application granted granted Critical
Publication of CN111104283B publication Critical patent/CN111104283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a fault detection method, a fault detection device, equipment and a computer readable storage medium of a distributed storage system, wherein the method comprises the following steps: determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system; acquiring the reporting times of each node in the distributed storage system which is reported as an abnormal state respectively; and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value. Therefore, the fault threshold value in the method is determined by using the corresponding calculation rule according to the storage pool type of the storage system, so that the fault condition of the distributed storage system is determined according to the reporting times and the fault threshold value, the fault misjudgment caused by the fault node of the back-end network reporting the abnormality of other nodes can be avoided, the fault detection accuracy of the distributed storage system is improved, and the normal use of the whole distributed storage system is relatively guaranteed.

Description

Fault detection method, device, equipment and medium of distributed storage system
Technical Field
The present invention relates to the field of distributed storage systems, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for detecting a failure in a distributed storage system.
Background
In the distributed storage system, a daemon process (or service) is arranged on each node to provide access, monitoring and the like for a hard disk in a storage pool; and detecting whether the daemon (or service) of the opposite terminal is normal or not through the heartbeat message among the daemons (or services) among different nodes.
For each node, the node comprises a front-end network and a back-end network, wherein the front-end network is used for customer service, and the back-end network is used for message communication and data interaction in the cluster; in order to detect the connectivity of the network, the daemon process among the nodes can simultaneously carry out heartbeat detection on the front-end network and the back-end network; and each node performs message interaction with the cluster management process through the front-end network. In this case, if the backend network of the individual node fails (actual failed node), which results in that the backend network cannot communicate with other nodes, the other nodes report the actual failed nodes as abnormal states to the cluster management process; the actual fault nodes can report the abnormal state of other nodes through the own front-end network because the actual fault nodes cannot communicate with other nodes.
In the prior art, a fixed fault threshold value is preset, and then when the reporting frequency of a certain node reported as an abnormal state exceeds the fault threshold value, the node is determined as a fault node. However, such a method would have a problem: for example, suppose that there are two actual failure nodes in the current distributed storage system, and other nodes will report that the two actual failure nodes are abnormal, and the two actual failure nodes will also report that all other nodes are abnormal; therefore, each of the other nodes is reported by at least the two actual fault nodes, and because the number of times that each of the other nodes is reported to be abnormal exceeds the preset threshold value, the cluster management process sets all the nodes as fault nodes, thereby causing the whole cluster to be unavailable. In reality, clusters may still be available with only two actual failed nodes. Therefore, in the fault detection method for the distributed storage system in the prior art, when the back-end network of the node fails, the abnormal condition of false alarm exists, so that the normal use of the whole distributed storage system is influenced.
Therefore, how to improve the accuracy of fault detection for the distributed storage system and relatively ensure normal use of the distributed storage system is a technical problem that needs to be solved by those skilled in the art at present.
Disclosure of Invention
In view of this, the present invention provides a method for detecting a fault in a distributed storage system, which can improve the accuracy of detecting a fault in a distributed storage system and relatively ensure the normal use of the distributed storage system; another object of the present invention is to provide a failure detection apparatus, device and computer readable storage medium for a distributed storage system, all having the above-mentioned advantages.
In order to solve the above technical problem, the present invention provides a method for detecting a failure of a distributed storage system, including:
determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system;
acquiring the reporting times of each node in the distributed storage system which is reported as an abnormal state respectively;
and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.
Preferably, the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:
if the storage pool of the distributed storage system is of a copy type, acquiring a first number of hard disks belonging to the same storage group in the storage pool;
setting a value greater than half the first number as the failure threshold value.
Preferably, the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:
if the storage pool of the distributed storage system is of an erasure correction type, acquiring a second quantity of redundant data calculated according to data partitioning in the storage pool;
setting a value greater than the second number as the fault threshold value.
Preferably, when there are a plurality of said storage pools on the first node, further comprising:
respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on the first node;
setting a maximum value of the plurality of failure threshold values as the failure threshold value of the first node.
Preferably, the process of acquiring the reporting times that each node in the distributed storage system is respectively reported to be in an abnormal state specifically includes:
and acquiring the reporting times of the abnormal state reported by each node in the distributed storage system according to a preset time period.
Preferably, after determining the failure condition of the distributed storage system according to each of the reporting times and the failure threshold value, the method further includes:
and displaying the number of the nodes which are currently determined as the fault nodes by using a display device.
Preferably, after determining the failure condition of the distributed storage system according to each of the number of reporting times and the failure threshold value, the method further includes:
and sending out corresponding prompt information.
In order to solve the above technical problem, the present invention further provides a failure detection apparatus for a distributed storage system, including:
the threshold value determining module is used for determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system;
the acquisition module is used for acquiring the reporting times of the abnormal states reported by each node in the distributed storage system;
and the fault determining module is used for determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.
Preferably, when there are a plurality of said storage pools on the first node, further comprising:
a first obtaining module, configured to obtain fault threshold values respectively corresponding to the storage pools on the first node;
a setting module configured to set a maximum value of the plurality of failure threshold values as the failure threshold value of the first node.
Preferably, further comprising:
and the display module is used for displaying the number of the nodes which are currently determined as the fault nodes by using the display device.
Preferably, further comprising:
and the prompt module is used for sending out corresponding prompt information.
In order to solve the above technical problem, the present invention further provides a fault detection device for a distributed storage system, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the fault detection method of any one distributed storage system when executing the computer program.
In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the fault detection method of any one of the above distributed storage systems.
Compared with the prior art in which a fixed fault threshold value is preset and the fault condition of the distributed storage system is determined according to the fixed fault threshold value, the fault threshold value in the method is determined by using the corresponding calculation rule according to the storage pool type of the storage system, so that the fault condition of the distributed storage system is determined according to the reporting times and the fault threshold value, the fault misjudgment caused by the fault node of the rear-end network falsifying other node abnormality can be avoided, the fault detection accuracy of the distributed storage system is improved, and the normal use of the whole distributed storage system is relatively guaranteed.
In order to solve the technical problem, the invention also provides a fault detection device, equipment and a computer readable storage medium of the distributed storage system, which have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting a fault in a distributed storage system according to an embodiment of the present invention;
fig. 2 is a structural diagram of a fault detection apparatus of a distributed storage system according to an embodiment of the present invention;
fig. 3 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The core of the embodiment of the invention is to provide a fault detection method of a distributed storage system, which can improve the accuracy of fault detection of the distributed storage system and relatively ensure the normal use of the distributed storage system; another core of the present invention is to provide a failure detection apparatus, a device and a computer-readable storage medium for a distributed storage system, all having the above beneficial effects.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart of a method for detecting a fault in a distributed storage system according to an embodiment of the present invention. As shown in fig. 1, a method for detecting a failure of a distributed storage system includes:
s10: and determining the fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system.
Specifically, in this embodiment, first, a storage pool type of the distributed storage system needs to be determined, where the storage pool type includes an erasure correction type copy type according to a storage fault tolerance mechanism of the storage pool. And then, according to the storage pool type, calculating a corresponding failure threshold value by using a calculation rule corresponding to the storage pool type.
S20: and acquiring the reporting times of the abnormal states reported by each node in the distributed storage system.
Specifically, when an actual fault node exists in the distributed storage system, if the actual fault node is a front-end network fault, other normal nodes report the actual fault node as an abnormal state to the cluster management process, and the actual fault node cannot report other normal nodes as an abnormal state due to the front-end network fault.
The embodiment mainly considers that an actual fault node is a back-end network fault, and at this time, the actual fault node cannot communicate with other nodes, and the other nodes report the actual fault node as an abnormal state to a cluster management process; meanwhile, the actual fault node reports other nodes as abnormal states to the cluster management process. At this time, the reporting times of the abnormal states reported by each node in the distributed storage system are obtained.
S30: and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.
After the failure threshold value is determined and the reporting times of abnormal states reported by each node in the distributed storage system are obtained, the failure threshold value serving as a reference value is determined according to the storage pool type corresponding to the node, and then the reporting times are compared with the failure threshold value; if the reporting times of a certain target node are larger than the fault threshold value, the target node is judged as a fault node; otherwise, the normal state of the target node is maintained.
In actual operation, after the nodes with the reporting times larger than the corresponding fault threshold value are judged as fault nodes, the number of the nodes of the fault nodes is determined; then, acquiring a preset node threshold corresponding to the storage pool type and used for determining whether cluster faults occur, and judging whether the number of nodes is greater than the node threshold; if so, judging that the cluster fault occurs in the distributed storage system; if not, the storage pool can still be used normally although the fault node exists in the current storage pool, that is, the cluster is normal.
It should be noted that tolerance to a failed node is different for different types of storage pools. Therefore, the node thresholds for different types of storage pools are also different, and are set as follows:
specifically, in an erasure correction type storage pool, according to a specified erasure correction algorithm, original data are equally divided into K parts of blocks, and then M parts of redundant data are calculated by the K parts of data blocks, so that K + M parts of data are finally obtained; then, the K + M data are written into K + M hard disks according to parts, each hard disk stores a different part of data, and any K parts of data read from the K + M data can be restored to original data at any time. In other words, for the erasure correction type storage pool, at least K normal data are needed to restore the original data, otherwise the data is lost. Therefore, the node threshold of the erasure correction type storage pool is (K + M-K), i.e., the number M of redundant data.
That is, as long as the number of nodes of the failed node does not exceed the node threshold M, the cluster can be used normally; if the number of the nodes of the fault node exceeds M, if the number of the nodes is (M +1), the original data cannot be restored because the number of the normal nodes is less than K, and therefore the cluster fault is determined.
Specifically, in the storage pool of the copy type, the copy is the data that is completely the same as the original data, and the N copy is to write the original data to N hard disks, where each hard disk corresponds to a copy of the data that is completely the same as the other hard disks. Therefore, for the copy-type storage pool, only one normal hard disk is needed, and the original data can be acquired by one normal copy of data. Thus, the node threshold for the copy type storage pool is (N-1).
In other words, as long as the number of nodes of the failed node does not exceed the node threshold (N-1), the cluster can be normally used, and if the number of nodes of the failed node exceeds (N-1), that is, the number of nodes is N, at this time, because the number of normal nodes is less than 1, the original data cannot be acquired, and thus the cluster fails.
Compared with the method for presetting a fixed fault threshold value and determining the fault condition of the distributed storage system according to the fixed fault threshold value in the prior art, the fault threshold value in the method is determined by using the corresponding calculation rule according to the storage pool type of the storage system, so that the fault condition of the distributed storage system is determined according to the reporting times and the fault threshold value, the fault misjudgment caused by the fault node of the rear-end network falsifying other node abnormalities can be avoided, the fault detection accuracy of the distributed storage system is improved, and the normal use of the whole distributed storage system is relatively guaranteed.
On the basis of the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:
if the storage pool of the distributed storage system is of a copy type, acquiring a first number of hard disks belonging to the same storage group in the storage pool;
a value greater than half the first number is set as the fault threshold value.
Specifically, first, a first number of hard disks belonging to the same storage group in the storage pool is obtained, where the hard disks belonging to the same storage group refer to hard disks written with identical data, that is, the number N of copies; values greater than half the first number are then set as a fault threshold value, that is, the fault threshold value is at least greater than N/2. As a preferred embodiment, the failure threshold value corresponding to the copy type storage pool may be (N/2+ 1).
It can be understood that, when a backend network of a certain node in the distributed storage system is abnormal (an actual fault node), each node in the cluster reports other nodes to each other as abnormal states, according to a minority of principles that obey majority, when (N/2+1) nodes report a certain target node as an abnormal state, that is, when the number of reporting times that the target node is reported as an abnormal state is (N/2+1), the target node is determined as a fault node; otherwise, the target node is kept in a normal state.
If the storage pool of the distributed storage system is of an erasure correction type, acquiring a second quantity of redundant data calculated according to data partitioning in the storage pool;
setting a value greater than the second number as a fault threshold value.
Specifically, equally dividing a part of original data into K parts of data blocks, calculating M parts of redundant data from the K parts of data blocks according to an erasure correction algorithm, and setting the number M of the redundant data as a second number; then, a value greater than the second number is set as a failure threshold value. As a preferred implementation manner, this embodiment sets the failure threshold value of the erasure correction type storage pool to (M + 1).
Specifically, for a cluster of an erasure pool with a storage pool of K + M, if the number of times that a certain target node is reported as an abnormal state is greater than (M +1), the target node is determined to be a failed node, at this time, since (M +1) nodes are required to report the target node as the abnormal state, when M or less than M nodes are actual failed nodes, other nodes report the actual failed nodes as the abnormal state, and the actual failed nodes also report other nodes as the abnormal state, but since the number of the actual failed nodes is less than or equal to M, the cluster management process only determines the actual failed nodes as the abnormal state, and does not consider other normal nodes as the abnormal state, so that the state of the normal nodes remains normal, and the number of the actual failed nodes is within M, and the number of the normal nodes is greater than or equal to K, the cluster can still work properly.
Therefore, according to the method for setting the fault threshold value provided by the embodiment, the cluster fault can be prevented from being mistakenly reported when the distributed storage system can normally work.
On the basis of the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, when there are multiple storage pools on the first node, the method further includes:
respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on a first node;
setting a maximum value of the plurality of failure threshold values to a failure threshold value of the first node.
It can be understood that, in an actual operation, a plurality of storage pools may exist on a certain node (referred to as a first node) in the distributed storage system, a maximum value of the plurality of failure threshold values is set as a failure threshold value of the first node, and the number of times that the first node reports to an abnormal state exceeds the failure threshold value corresponding to the maximum value is determined that the first node is a failed node.
Taking an erasure correction type storage pool as an example, if two erasure correction pools exist at the same time, the first is 2+1(K is 2, M is 1) redundant, and the second is 3+2(K is 3, M is 2) redundant, the redundancy of the second storage pool is used as the standard, so that when M +1 is 2+1, 3 nodes report that the first node is in an abnormal state, the first node is determined to be a failed node.
As can be seen, the method for setting the fault threshold value correspondingly when a plurality of storage pools are located on a node is further considered in this embodiment, so that the accuracy of detecting the fault condition of the distributed storage system can be further improved.
On the basis of the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, the process of obtaining the reporting times that each node in the distributed storage system is respectively reported as an abnormal state specifically includes:
and acquiring the reporting times of the abnormal state reported by each node in the distributed storage system according to a preset time period.
Specifically, in this embodiment, the reporting times of the abnormal state reported by each node in the distributed storage system is specifically obtained according to a preset time period. It can be understood that, when obtaining the reporting times of each node, generally, the reporting times corresponding to each node in a certain time period are obtained, in this embodiment, the reporting times of each node in a preset time period are further obtained according to a preset time period. The length of the preset time period is not limited in this embodiment, and is specifically set according to an actual situation.
Therefore, the reporting times of the abnormal state reported by each node in the distributed storage system are obtained according to the preset time period, the obtained condition of each node in the distributed storage system can be updated in time, and the fault condition in the distributed storage system can be further determined in time.
On the basis of the foregoing embodiment, the present embodiment further describes and optimizes the technical solution, and specifically, after determining the fault condition of the distributed storage system according to each reporting number and the fault threshold value, the present embodiment further includes:
and displaying the number of the nodes which are currently determined as the fault nodes by using a display device.
Specifically, in this embodiment, because the reporting times of the abnormal state reported by each node in the distributed storage system are obtained according to the preset time period, the number of nodes that are determined to be a failed node may change. Therefore, the present embodiment further displays the number of nodes using a display device.
Specifically, the display device may be a Thin Film Transistor (TFT) liquid crystal display, an Ultra Thin Film and Bright (UFB) liquid crystal display, an Organic Light-emitting diode (OLED) display, or the like, which is not limited in this embodiment.
In addition, the method for displaying the number of the nodes is not limited in this embodiment, and for example, the number of the nodes may be displayed in a text, image or animation manner, which is specifically set according to actual requirements.
Therefore, in the embodiment, the number of the nodes which are currently determined as the fault nodes is further displayed by the display device, so that a user can directly know the number of the nodes which are currently determined as the fault nodes in the distributed storage system, and the use experience of the user is further improved.
On the basis of the foregoing embodiment, the present embodiment further describes and optimizes the technical solution, and specifically, after determining the fault condition of the distributed storage system according to each reporting number and the fault threshold value, the present embodiment further includes:
and sending out corresponding prompt information.
Specifically, in this embodiment, after determining the fault condition of the distributed storage system according to each reporting number and the fault threshold value, if the cluster fault occurs in the distributed storage system, the prompting device is further utilized to send out corresponding prompting information, so as to prompt an operator that the current distributed storage system cluster is unavailable, and cannot be read or written, or the data loss may be caused.
It should be noted that, in this embodiment, specific types of the prompting devices for sending out corresponding prompting information are not limited, and as a preferred embodiment, the prompting devices may be buzzers and/or indicator lights, and corresponding prompting information is set according to different operating states of the prompting devices.
Therefore, in the embodiment, the corresponding prompt information is sent out after the cluster fault of the distributed storage system is further determined, so that the fault condition of the distributed storage system can be timely and effectively prompted to an operator, and the use experience of a user is further improved.
The above detailed description is given for the embodiment of the method for detecting a failure in a distributed storage system, and the present invention further provides an apparatus, a device, and a computer-readable storage medium for detecting a failure in a distributed storage system corresponding to the method.
Fig. 2 is a structural diagram of a fault detection apparatus of a distributed storage system according to an embodiment of the present invention, and as shown in fig. 2, the fault detection apparatus of the distributed storage system includes:
a threshold value determining module 21, configured to determine a failure threshold value according to a storage pool type of the distributed storage system by using a corresponding calculation rule;
an obtaining module 22, configured to obtain reporting times that each node in the distributed storage system is respectively reported to be in an abnormal state;
and the fault determining module 23 is configured to determine a fault condition of the distributed storage system according to each reporting time and the fault threshold value.
The fault detection device of the distributed storage system provided by the embodiment of the invention has the beneficial effects of the fault detection method of the distributed storage system.
In a preferred embodiment, when there are multiple storage pools on the first node, the method further comprises:
the first acquisition module is used for respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on a first node;
a setting module, configured to set a maximum value of the multiple failure threshold values as a failure threshold value of the first node.
As a preferred embodiment, further comprising:
and the display module is used for displaying the number of the nodes which are currently determined as the fault nodes by using the display device.
As a preferred embodiment, further comprising:
and the prompt module is used for sending out corresponding prompt information.
Fig. 3 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention, and as shown in fig. 3, the fault detection device of the distributed storage system includes:
a memory 31 for storing a computer program;
the processor 2 is configured to implement the steps of the fault detection method of the distributed storage system when executing the computer program.
The fault detection equipment of the distributed storage system provided by the embodiment of the invention has the beneficial effects of the fault detection method of the distributed storage system.
In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the fault detection method of the distributed storage system.
The computer-readable storage medium provided by the embodiment of the invention has the beneficial effects of the fault detection method of the distributed storage system.
The method, apparatus, device and computer readable storage medium for detecting faults of a distributed storage system provided by the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are set forth only to help understand the method and its core ideas of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims (10)

1. A method for fault detection in a distributed storage system, comprising:
determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system;
acquiring the reporting times of each node in the distributed storage system which is reported as an abnormal state respectively;
and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.
2. The method according to claim 1, wherein the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:
if the storage pool of the distributed storage system is of a copy type, acquiring a first number of hard disks belonging to the same storage group in the storage pool;
setting a value greater than half the first number as the failure threshold value.
3. The method according to claim 2, wherein the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:
if the storage pool of the distributed storage system is of an erasure correction type, acquiring a second quantity of redundant data calculated according to data partitioning in the storage pool;
setting a value greater than the second number as the fault threshold value.
4. The method of claim 3, wherein when there are multiple storage pools on the first node, further comprising:
respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on the first node;
setting a maximum value of the plurality of failure threshold values as the failure threshold value of the first node.
5. The method according to claim 1, wherein the process of obtaining the reporting times of the abnormal states reported by the nodes in the distributed storage system includes:
and acquiring the reporting times of the abnormal state reported by each node in the distributed storage system according to a preset time period.
6. The method of claim 5, wherein after determining the failure condition of the distributed storage system according to each of the number of reports and the failure threshold, the method further comprises:
and displaying the number of the nodes which are currently determined as the fault nodes by using a display device.
7. The method according to any one of claims 1 to 6, wherein after determining the failure condition of the distributed storage system according to each of the number of reports and the failure threshold value, the method further comprises:
and sending out corresponding prompt information.
8. A failure detection apparatus for a distributed storage system, comprising:
the threshold value determining module is used for determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system;
the acquisition module is used for acquiring the reporting times of the abnormal states reported by each node in the distributed storage system;
and the fault determining module is used for determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.
9. A failure detection apparatus of a distributed storage system, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of fault detection of a distributed storage system as claimed in any one of claims 1 to 7 when executing said computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of fault detection of a distributed storage system according to any one of claims 1 to 7.
CN201911207102.XA 2019-11-29 2019-11-29 Fault detection method, device, equipment and medium of distributed storage system Active CN111104283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911207102.XA CN111104283B (en) 2019-11-29 2019-11-29 Fault detection method, device, equipment and medium of distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911207102.XA CN111104283B (en) 2019-11-29 2019-11-29 Fault detection method, device, equipment and medium of distributed storage system

Publications (2)

Publication Number Publication Date
CN111104283A true CN111104283A (en) 2020-05-05
CN111104283B CN111104283B (en) 2022-04-22

Family

ID=70420956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911207102.XA Active CN111104283B (en) 2019-11-29 2019-11-29 Fault detection method, device, equipment and medium of distributed storage system

Country Status (1)

Country Link
CN (1) CN111104283B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162699A (en) * 2020-09-18 2021-01-01 北京浪潮数据技术有限公司 Data reading and writing method, device and equipment and computer readable storage medium
CN114443332A (en) * 2021-12-24 2022-05-06 苏州浪潮智能科技有限公司 Storage pool detection method and device, electronic equipment and storage medium
CN114780442A (en) * 2022-06-22 2022-07-22 杭州悦数科技有限公司 Testing method and device for distributed system
TWI789075B (en) * 2021-10-26 2023-01-01 中華電信股份有限公司 Electronic device and method for detecting abnormal execution of application program

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827120A (en) * 2010-02-25 2010-09-08 浪潮(北京)电子信息产业有限公司 Cluster storage method and system
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN104735107A (en) * 2013-12-20 2015-06-24 中国移动通信集团公司 Recovery method and device for data copies in distributed storage system
US20170083241A1 (en) * 2014-02-20 2017-03-23 Quantum Corporation Dynamically configuring erasure code redundancy and distribution
CN107346273A (en) * 2017-06-14 2017-11-14 北京奇艺世纪科技有限公司 A kind of data reconstruction method, device and electronic equipment
CN107783857A (en) * 2017-10-31 2018-03-09 珠海市魅族科技有限公司 A kind of abnormal restorative procedure and device, computer installation, readable storage medium storing program for executing
CN109144835A (en) * 2018-08-02 2019-01-04 广东浪潮大数据研究有限公司 A kind of automatic prediction method, device, equipment and the medium of application service failure
CN109189352A (en) * 2018-09-11 2019-01-11 宜春小马快印科技有限公司 Printer fault monitoring method, device, system and readable storage medium storing program for executing
CN109213637A (en) * 2018-11-09 2019-01-15 浪潮电子信息产业股份有限公司 Data reconstruction method, device and the medium of distributed file system clustered node
CN109557994A (en) * 2018-11-29 2019-04-02 努比亚技术有限公司 A kind of charge fault monitoring method, equipment and computer can storage mediums
CN109726048A (en) * 2018-12-13 2019-05-07 中国银联股份有限公司 Data reconstruction method and device in a kind of transaction system
CN110457194A (en) * 2019-08-02 2019-11-15 广东小天才科技有限公司 Electronic equipment stability method for early warning, system, device, equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827120A (en) * 2010-02-25 2010-09-08 浪潮(北京)电子信息产业有限公司 Cluster storage method and system
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN104735107A (en) * 2013-12-20 2015-06-24 中国移动通信集团公司 Recovery method and device for data copies in distributed storage system
US20170083241A1 (en) * 2014-02-20 2017-03-23 Quantum Corporation Dynamically configuring erasure code redundancy and distribution
CN107346273A (en) * 2017-06-14 2017-11-14 北京奇艺世纪科技有限公司 A kind of data reconstruction method, device and electronic equipment
CN107783857A (en) * 2017-10-31 2018-03-09 珠海市魅族科技有限公司 A kind of abnormal restorative procedure and device, computer installation, readable storage medium storing program for executing
CN109144835A (en) * 2018-08-02 2019-01-04 广东浪潮大数据研究有限公司 A kind of automatic prediction method, device, equipment and the medium of application service failure
CN109189352A (en) * 2018-09-11 2019-01-11 宜春小马快印科技有限公司 Printer fault monitoring method, device, system and readable storage medium storing program for executing
CN109213637A (en) * 2018-11-09 2019-01-15 浪潮电子信息产业股份有限公司 Data reconstruction method, device and the medium of distributed file system clustered node
CN109557994A (en) * 2018-11-29 2019-04-02 努比亚技术有限公司 A kind of charge fault monitoring method, equipment and computer can storage mediums
CN109726048A (en) * 2018-12-13 2019-05-07 中国银联股份有限公司 Data reconstruction method and device in a kind of transaction system
CN110457194A (en) * 2019-08-02 2019-11-15 广东小天才科技有限公司 Electronic equipment stability method for early warning, system, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MEHRAN MOZAFFARI-KERMANI 等: "A Low-Power High-Performance Concurrent Fault Detection Approach for the Composite Field S-Box and Inverse S-Box", 《IEEE TRANSACTIONS ON COMPUTERS》 *
官斌: "分布式存储系统的数据冗余策略", 《武汉大学学报(工学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162699A (en) * 2020-09-18 2021-01-01 北京浪潮数据技术有限公司 Data reading and writing method, device and equipment and computer readable storage medium
CN112162699B (en) * 2020-09-18 2023-12-22 北京浪潮数据技术有限公司 Data reading and writing method, device, equipment and computer readable storage medium
TWI789075B (en) * 2021-10-26 2023-01-01 中華電信股份有限公司 Electronic device and method for detecting abnormal execution of application program
CN114443332A (en) * 2021-12-24 2022-05-06 苏州浪潮智能科技有限公司 Storage pool detection method and device, electronic equipment and storage medium
CN114443332B (en) * 2021-12-24 2024-01-09 苏州浪潮智能科技有限公司 Storage pool detection method and device, electronic equipment and storage medium
CN114780442A (en) * 2022-06-22 2022-07-22 杭州悦数科技有限公司 Testing method and device for distributed system

Also Published As

Publication number Publication date
CN111104283B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN111104283B (en) Fault detection method, device, equipment and medium of distributed storage system
US10891182B2 (en) Proactive failure handling in data processing systems
US8244934B2 (en) Data storage network management
US11438249B2 (en) Cluster management method, apparatus and system
US9069819B1 (en) Method and apparatus for reliable I/O performance anomaly detection in datacenter
CN112380089A (en) Data center monitoring and early warning method and system
US7278048B2 (en) Method, system and computer program product for improving system reliability
CN111857555A (en) Method, apparatus and program product for avoiding failure events of disk arrays
CN108021490B (en) Hard disk fault domain detection method and device and computer readable storage medium
US8984333B2 (en) Automatic computer storage medium diagnostics
CN113961478A (en) Memory fault recording method and device
CN106951445A (en) A kind of distributed file system and its memory node loading method
CN109271270A (en) The troubleshooting methodology, system and relevant apparatus of bottom hardware in storage system
CN110968456B (en) Method and device for processing fault disk in distributed storage system
JPH05260049A (en) Fault managing method for network system
CN115687026A (en) Multi-node server fault early warning method, device, equipment and medium
CN111309515A (en) Disaster recovery control method, device and system
CN110554929A (en) Data verification method and device, computer equipment and storage medium
CN112269693B (en) Node self-coordination method, device and computer readable storage medium
CN112799911A (en) Node health state detection method, device, equipment and storage medium
CN115150253B (en) Fault root cause determining method and device and electronic equipment
CN117312081A (en) Fault detection method, device, equipment and medium for distributed storage system
CN109218206B (en) Method and device for limiting link state advertisement quantity
CN114914008A (en) Control method and device for emergency response action of nuclear power plant, electronic equipment and medium
CN117827509A (en) Database abnormality detection processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant