CN111104283A

CN111104283A - Fault detection method, device, equipment and medium of distributed storage system

Info

Publication number: CN111104283A
Application number: CN201911207102.XA
Authority: CN
Inventors: 甄天桥; 孟祥瑞
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-05-05
Anticipated expiration: 2039-11-29
Also published as: CN111104283B

Abstract

The application discloses a fault detection method, a fault detection device, equipment and a computer readable storage medium of a distributed storage system, wherein the method comprises the following steps: determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system; acquiring the reporting times of each node in the distributed storage system which is reported as an abnormal state respectively; and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value. Therefore, the fault threshold value in the method is determined by using the corresponding calculation rule according to the storage pool type of the storage system, so that the fault condition of the distributed storage system is determined according to the reporting times and the fault threshold value, the fault misjudgment caused by the fault node of the back-end network reporting the abnormality of other nodes can be avoided, the fault detection accuracy of the distributed storage system is improved, and the normal use of the whole distributed storage system is relatively guaranteed.

Description

Fault detection method, device, equipment and medium of distributed storage system

Technical Field

The present invention relates to the field of distributed storage systems, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for detecting a failure in a distributed storage system.

Background

In the distributed storage system, a daemon process (or service) is arranged on each node to provide access, monitoring and the like for a hard disk in a storage pool; and detecting whether the daemon (or service) of the opposite terminal is normal or not through the heartbeat message among the daemons (or services) among different nodes.

For each node, the node comprises a front-end network and a back-end network, wherein the front-end network is used for customer service, and the back-end network is used for message communication and data interaction in the cluster; in order to detect the connectivity of the network, the daemon process among the nodes can simultaneously carry out heartbeat detection on the front-end network and the back-end network; and each node performs message interaction with the cluster management process through the front-end network. In this case, if the backend network of the individual node fails (actual failed node), which results in that the backend network cannot communicate with other nodes, the other nodes report the actual failed nodes as abnormal states to the cluster management process; the actual fault nodes can report the abnormal state of other nodes through the own front-end network because the actual fault nodes cannot communicate with other nodes.

In the prior art, a fixed fault threshold value is preset, and then when the reporting frequency of a certain node reported as an abnormal state exceeds the fault threshold value, the node is determined as a fault node. However, such a method would have a problem: for example, suppose that there are two actual failure nodes in the current distributed storage system, and other nodes will report that the two actual failure nodes are abnormal, and the two actual failure nodes will also report that all other nodes are abnormal; therefore, each of the other nodes is reported by at least the two actual fault nodes, and because the number of times that each of the other nodes is reported to be abnormal exceeds the preset threshold value, the cluster management process sets all the nodes as fault nodes, thereby causing the whole cluster to be unavailable. In reality, clusters may still be available with only two actual failed nodes. Therefore, in the fault detection method for the distributed storage system in the prior art, when the back-end network of the node fails, the abnormal condition of false alarm exists, so that the normal use of the whole distributed storage system is influenced.

Therefore, how to improve the accuracy of fault detection for the distributed storage system and relatively ensure normal use of the distributed storage system is a technical problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

In view of this, the present invention provides a method for detecting a fault in a distributed storage system, which can improve the accuracy of detecting a fault in a distributed storage system and relatively ensure the normal use of the distributed storage system; another object of the present invention is to provide a failure detection apparatus, device and computer readable storage medium for a distributed storage system, all having the above-mentioned advantages.

In order to solve the above technical problem, the present invention provides a method for detecting a failure of a distributed storage system, including:

determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system;

acquiring the reporting times of each node in the distributed storage system which is reported as an abnormal state respectively;

and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.

Preferably, the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:

if the storage pool of the distributed storage system is of a copy type, acquiring a first number of hard disks belonging to the same storage group in the storage pool;

setting a value greater than half the first number as the failure threshold value.

if the storage pool of the distributed storage system is of an erasure correction type, acquiring a second quantity of redundant data calculated according to data partitioning in the storage pool;

setting a value greater than the second number as the fault threshold value.

Preferably, when there are a plurality of said storage pools on the first node, further comprising:

respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on the first node;

setting a maximum value of the plurality of failure threshold values as the failure threshold value of the first node.

Preferably, the process of acquiring the reporting times that each node in the distributed storage system is respectively reported to be in an abnormal state specifically includes:

and acquiring the reporting times of the abnormal state reported by each node in the distributed storage system according to a preset time period.

Preferably, after determining the failure condition of the distributed storage system according to each of the reporting times and the failure threshold value, the method further includes:

and displaying the number of the nodes which are currently determined as the fault nodes by using a display device.

Preferably, after determining the failure condition of the distributed storage system according to each of the number of reporting times and the failure threshold value, the method further includes:

and sending out corresponding prompt information.

In order to solve the above technical problem, the present invention further provides a failure detection apparatus for a distributed storage system, including:

the threshold value determining module is used for determining a fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system;

the acquisition module is used for acquiring the reporting times of the abnormal states reported by each node in the distributed storage system;

and the fault determining module is used for determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.

a first obtaining module, configured to obtain fault threshold values respectively corresponding to the storage pools on the first node;

a setting module configured to set a maximum value of the plurality of failure threshold values as the failure threshold value of the first node.

Preferably, further comprising:

and the display module is used for displaying the number of the nodes which are currently determined as the fault nodes by using the display device.

Preferably, further comprising:

and the prompt module is used for sending out corresponding prompt information.

In order to solve the above technical problem, the present invention further provides a fault detection device for a distributed storage system, including:

a memory for storing a computer program;

and the processor is used for realizing the steps of the fault detection method of any one distributed storage system when executing the computer program.

In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the fault detection method of any one of the above distributed storage systems.

Compared with the prior art in which a fixed fault threshold value is preset and the fault condition of the distributed storage system is determined according to the fixed fault threshold value, the fault threshold value in the method is determined by using the corresponding calculation rule according to the storage pool type of the storage system, so that the fault condition of the distributed storage system is determined according to the reporting times and the fault threshold value, the fault misjudgment caused by the fault node of the rear-end network falsifying other node abnormality can be avoided, the fault detection accuracy of the distributed storage system is improved, and the normal use of the whole distributed storage system is relatively guaranteed.

In order to solve the technical problem, the invention also provides a fault detection device, equipment and a computer readable storage medium of the distributed storage system, which have the beneficial effects.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting a fault in a distributed storage system according to an embodiment of the present invention;

fig. 2 is a structural diagram of a fault detection apparatus of a distributed storage system according to an embodiment of the present invention;

fig. 3 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The core of the embodiment of the invention is to provide a fault detection method of a distributed storage system, which can improve the accuracy of fault detection of the distributed storage system and relatively ensure the normal use of the distributed storage system; another core of the present invention is to provide a failure detection apparatus, a device and a computer-readable storage medium for a distributed storage system, all having the above beneficial effects.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a flowchart of a method for detecting a fault in a distributed storage system according to an embodiment of the present invention. As shown in fig. 1, a method for detecting a failure of a distributed storage system includes:

s10: and determining the fault threshold value by using a corresponding calculation rule according to the storage pool type of the distributed storage system.

Specifically, in this embodiment, first, a storage pool type of the distributed storage system needs to be determined, where the storage pool type includes an erasure correction type copy type according to a storage fault tolerance mechanism of the storage pool. And then, according to the storage pool type, calculating a corresponding failure threshold value by using a calculation rule corresponding to the storage pool type.

S20: and acquiring the reporting times of the abnormal states reported by each node in the distributed storage system.

Specifically, when an actual fault node exists in the distributed storage system, if the actual fault node is a front-end network fault, other normal nodes report the actual fault node as an abnormal state to the cluster management process, and the actual fault node cannot report other normal nodes as an abnormal state due to the front-end network fault.

The embodiment mainly considers that an actual fault node is a back-end network fault, and at this time, the actual fault node cannot communicate with other nodes, and the other nodes report the actual fault node as an abnormal state to a cluster management process; meanwhile, the actual fault node reports other nodes as abnormal states to the cluster management process. At this time, the reporting times of the abnormal states reported by each node in the distributed storage system are obtained.

S30: and determining the fault condition of the distributed storage system according to the reporting times and the fault threshold value.

After the failure threshold value is determined and the reporting times of abnormal states reported by each node in the distributed storage system are obtained, the failure threshold value serving as a reference value is determined according to the storage pool type corresponding to the node, and then the reporting times are compared with the failure threshold value; if the reporting times of a certain target node are larger than the fault threshold value, the target node is judged as a fault node; otherwise, the normal state of the target node is maintained.

In actual operation, after the nodes with the reporting times larger than the corresponding fault threshold value are judged as fault nodes, the number of the nodes of the fault nodes is determined; then, acquiring a preset node threshold corresponding to the storage pool type and used for determining whether cluster faults occur, and judging whether the number of nodes is greater than the node threshold; if so, judging that the cluster fault occurs in the distributed storage system; if not, the storage pool can still be used normally although the fault node exists in the current storage pool, that is, the cluster is normal.

It should be noted that tolerance to a failed node is different for different types of storage pools. Therefore, the node thresholds for different types of storage pools are also different, and are set as follows:

specifically, in an erasure correction type storage pool, according to a specified erasure correction algorithm, original data are equally divided into K parts of blocks, and then M parts of redundant data are calculated by the K parts of data blocks, so that K + M parts of data are finally obtained; then, the K + M data are written into K + M hard disks according to parts, each hard disk stores a different part of data, and any K parts of data read from the K + M data can be restored to original data at any time. In other words, for the erasure correction type storage pool, at least K normal data are needed to restore the original data, otherwise the data is lost. Therefore, the node threshold of the erasure correction type storage pool is (K + M-K), i.e., the number M of redundant data.

That is, as long as the number of nodes of the failed node does not exceed the node threshold M, the cluster can be used normally; if the number of the nodes of the fault node exceeds M, if the number of the nodes is (M +1), the original data cannot be restored because the number of the normal nodes is less than K, and therefore the cluster fault is determined.

Specifically, in the storage pool of the copy type, the copy is the data that is completely the same as the original data, and the N copy is to write the original data to N hard disks, where each hard disk corresponds to a copy of the data that is completely the same as the other hard disks. Therefore, for the copy-type storage pool, only one normal hard disk is needed, and the original data can be acquired by one normal copy of data. Thus, the node threshold for the copy type storage pool is (N-1).

In other words, as long as the number of nodes of the failed node does not exceed the node threshold (N-1), the cluster can be normally used, and if the number of nodes of the failed node exceeds (N-1), that is, the number of nodes is N, at this time, because the number of normal nodes is less than 1, the original data cannot be acquired, and thus the cluster fails.

Compared with the method for presetting a fixed fault threshold value and determining the fault condition of the distributed storage system according to the fixed fault threshold value in the prior art, the fault threshold value in the method is determined by using the corresponding calculation rule according to the storage pool type of the storage system, so that the fault condition of the distributed storage system is determined according to the reporting times and the fault threshold value, the fault misjudgment caused by the fault node of the rear-end network falsifying other node abnormalities can be avoided, the fault detection accuracy of the distributed storage system is improved, and the normal use of the whole distributed storage system is relatively guaranteed.

On the basis of the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:

a value greater than half the first number is set as the fault threshold value.

Specifically, first, a first number of hard disks belonging to the same storage group in the storage pool is obtained, where the hard disks belonging to the same storage group refer to hard disks written with identical data, that is, the number N of copies; values greater than half the first number are then set as a fault threshold value, that is, the fault threshold value is at least greater than N/2. As a preferred embodiment, the failure threshold value corresponding to the copy type storage pool may be (N/2+ 1).

It can be understood that, when a backend network of a certain node in the distributed storage system is abnormal (an actual fault node), each node in the cluster reports other nodes to each other as abnormal states, according to a minority of principles that obey majority, when (N/2+1) nodes report a certain target node as an abnormal state, that is, when the number of reporting times that the target node is reported as an abnormal state is (N/2+1), the target node is determined as a fault node; otherwise, the target node is kept in a normal state.

setting a value greater than the second number as a fault threshold value.

Specifically, equally dividing a part of original data into K parts of data blocks, calculating M parts of redundant data from the K parts of data blocks according to an erasure correction algorithm, and setting the number M of the redundant data as a second number; then, a value greater than the second number is set as a failure threshold value. As a preferred implementation manner, this embodiment sets the failure threshold value of the erasure correction type storage pool to (M + 1).

Specifically, for a cluster of an erasure pool with a storage pool of K + M, if the number of times that a certain target node is reported as an abnormal state is greater than (M +1), the target node is determined to be a failed node, at this time, since (M +1) nodes are required to report the target node as the abnormal state, when M or less than M nodes are actual failed nodes, other nodes report the actual failed nodes as the abnormal state, and the actual failed nodes also report other nodes as the abnormal state, but since the number of the actual failed nodes is less than or equal to M, the cluster management process only determines the actual failed nodes as the abnormal state, and does not consider other normal nodes as the abnormal state, so that the state of the normal nodes remains normal, and the number of the actual failed nodes is within M, and the number of the normal nodes is greater than or equal to K, the cluster can still work properly.

Therefore, according to the method for setting the fault threshold value provided by the embodiment, the cluster fault can be prevented from being mistakenly reported when the distributed storage system can normally work.

On the basis of the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, when there are multiple storage pools on the first node, the method further includes:

respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on a first node;

setting a maximum value of the plurality of failure threshold values to a failure threshold value of the first node.

It can be understood that, in an actual operation, a plurality of storage pools may exist on a certain node (referred to as a first node) in the distributed storage system, a maximum value of the plurality of failure threshold values is set as a failure threshold value of the first node, and the number of times that the first node reports to an abnormal state exceeds the failure threshold value corresponding to the maximum value is determined that the first node is a failed node.

Taking an erasure correction type storage pool as an example, if two erasure correction pools exist at the same time, the first is 2+1(K is 2, M is 1) redundant, and the second is 3+2(K is 3, M is 2) redundant, the redundancy of the second storage pool is used as the standard, so that when M +1 is 2+1, 3 nodes report that the first node is in an abnormal state, the first node is determined to be a failed node.

As can be seen, the method for setting the fault threshold value correspondingly when a plurality of storage pools are located on a node is further considered in this embodiment, so that the accuracy of detecting the fault condition of the distributed storage system can be further improved.

On the basis of the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, the process of obtaining the reporting times that each node in the distributed storage system is respectively reported as an abnormal state specifically includes:

Specifically, in this embodiment, the reporting times of the abnormal state reported by each node in the distributed storage system is specifically obtained according to a preset time period. It can be understood that, when obtaining the reporting times of each node, generally, the reporting times corresponding to each node in a certain time period are obtained, in this embodiment, the reporting times of each node in a preset time period are further obtained according to a preset time period. The length of the preset time period is not limited in this embodiment, and is specifically set according to an actual situation.

Therefore, the reporting times of the abnormal state reported by each node in the distributed storage system are obtained according to the preset time period, the obtained condition of each node in the distributed storage system can be updated in time, and the fault condition in the distributed storage system can be further determined in time.

On the basis of the foregoing embodiment, the present embodiment further describes and optimizes the technical solution, and specifically, after determining the fault condition of the distributed storage system according to each reporting number and the fault threshold value, the present embodiment further includes:

Specifically, in this embodiment, because the reporting times of the abnormal state reported by each node in the distributed storage system are obtained according to the preset time period, the number of nodes that are determined to be a failed node may change. Therefore, the present embodiment further displays the number of nodes using a display device.

Specifically, the display device may be a Thin Film Transistor (TFT) liquid crystal display, an Ultra Thin Film and Bright (UFB) liquid crystal display, an Organic Light-emitting diode (OLED) display, or the like, which is not limited in this embodiment.

In addition, the method for displaying the number of the nodes is not limited in this embodiment, and for example, the number of the nodes may be displayed in a text, image or animation manner, which is specifically set according to actual requirements.

Therefore, in the embodiment, the number of the nodes which are currently determined as the fault nodes is further displayed by the display device, so that a user can directly know the number of the nodes which are currently determined as the fault nodes in the distributed storage system, and the use experience of the user is further improved.

and sending out corresponding prompt information.

Specifically, in this embodiment, after determining the fault condition of the distributed storage system according to each reporting number and the fault threshold value, if the cluster fault occurs in the distributed storage system, the prompting device is further utilized to send out corresponding prompting information, so as to prompt an operator that the current distributed storage system cluster is unavailable, and cannot be read or written, or the data loss may be caused.

It should be noted that, in this embodiment, specific types of the prompting devices for sending out corresponding prompting information are not limited, and as a preferred embodiment, the prompting devices may be buzzers and/or indicator lights, and corresponding prompting information is set according to different operating states of the prompting devices.

Therefore, in the embodiment, the corresponding prompt information is sent out after the cluster fault of the distributed storage system is further determined, so that the fault condition of the distributed storage system can be timely and effectively prompted to an operator, and the use experience of a user is further improved.

The above detailed description is given for the embodiment of the method for detecting a failure in a distributed storage system, and the present invention further provides an apparatus, a device, and a computer-readable storage medium for detecting a failure in a distributed storage system corresponding to the method.

Fig. 2 is a structural diagram of a fault detection apparatus of a distributed storage system according to an embodiment of the present invention, and as shown in fig. 2, the fault detection apparatus of the distributed storage system includes:

a threshold value determining module 21, configured to determine a failure threshold value according to a storage pool type of the distributed storage system by using a corresponding calculation rule;

an obtaining module 22, configured to obtain reporting times that each node in the distributed storage system is respectively reported to be in an abnormal state;

and the fault determining module 23 is configured to determine a fault condition of the distributed storage system according to each reporting time and the fault threshold value.

The fault detection device of the distributed storage system provided by the embodiment of the invention has the beneficial effects of the fault detection method of the distributed storage system.

In a preferred embodiment, when there are multiple storage pools on the first node, the method further comprises:

the first acquisition module is used for respectively acquiring fault threshold values respectively corresponding to a plurality of storage pools on a first node;

a setting module, configured to set a maximum value of the multiple failure threshold values as a failure threshold value of the first node.

As a preferred embodiment, further comprising:

and the prompt module is used for sending out corresponding prompt information.

Fig. 3 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention, and as shown in fig. 3, the fault detection device of the distributed storage system includes:

a memory 31 for storing a computer program;

the processor 2 is configured to implement the steps of the fault detection method of the distributed storage system when executing the computer program.

The fault detection equipment of the distributed storage system provided by the embodiment of the invention has the beneficial effects of the fault detection method of the distributed storage system.

In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the fault detection method of the distributed storage system.

The computer-readable storage medium provided by the embodiment of the invention has the beneficial effects of the fault detection method of the distributed storage system.

The method, apparatus, device and computer readable storage medium for detecting faults of a distributed storage system provided by the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are set forth only to help understand the method and its core ideas of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims

1. A method for fault detection in a distributed storage system, comprising:

2. The method according to claim 1, wherein the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:

3. The method according to claim 2, wherein the process of determining the failure threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes:

setting a value greater than the second number as the fault threshold value.

4. The method of claim 3, wherein when there are multiple storage pools on the first node, further comprising:

5. The method according to claim 1, wherein the process of obtaining the reporting times of the abnormal states reported by the nodes in the distributed storage system includes:

6. The method of claim 5, wherein after determining the failure condition of the distributed storage system according to each of the number of reports and the failure threshold, the method further comprises:

7. The method according to any one of claims 1 to 6, wherein after determining the failure condition of the distributed storage system according to each of the number of reports and the failure threshold value, the method further comprises:

and sending out corresponding prompt information.

8. A failure detection apparatus for a distributed storage system, comprising:

9. A failure detection apparatus of a distributed storage system, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of fault detection of a distributed storage system as claimed in any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of fault detection of a distributed storage system according to any one of claims 1 to 7.