CN109274544B

CN109274544B - Fault detection method and device for distributed storage system

Info

Publication number: CN109274544B
Application number: CN201811511589.6A
Authority: CN
Inventors: 许银龙
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2021-06-29
Anticipated expiration: 2038-12-11
Also published as: CN109274544A

Abstract

The invention discloses a fault detection method of a distributed storage system, which can monitor the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk and judge that the disk has a fault when the execution state is abnormal. Therefore, in the process of performing data operation on the disk by the storage node, the process for executing the operation returns an execution state parameter, and the execution state reflects whether the data operation is successfully performed on the disk by the storage node, so that a signal of disk failure can be responsively captured according to the execution state of the process without heartbeat detection, thereby saving the computing resources of the storage node and avoiding the problem that the heartbeat interval is difficult to accurately set. In addition, the invention also provides a fault detection device of the distributed slave storage system and the distributed storage system, and the function of the distributed slave storage system corresponds to the method.

Description

Fault detection method and device for distributed storage system

Technical Field

The present invention relates to the field of storage, and in particular, to a method and an apparatus for detecting a failure of a distributed storage system, and a distributed storage system.

Background

With the rise and popularization of cloud computing technology, distributed storage systems are getting more and more concerned by the industry. The traditional network storage system adopts a centralized storage server to store all data, the storage server becomes the bottleneck of the system performance, is also the focus of reliability and safety, and cannot meet the requirement of large-scale storage application. The distributed network storage system adopts an expandable system structure, and utilizes a plurality of storage servers to share the storage load, thereby not only improving the reliability, the availability and the access efficiency of the system, but also being easy to expand. For distributed storage, stability and reliability of the clusters are crucial.

When a disk or a storage node of a distributed system fails, stability and reliability parameters of a cluster are often greatly influenced, and the storage system may be temporarily unavailable. The current distributed storage system usually adopts a heartbeat detection mode to detect faults, and when the heartbeat time detects timeout, the faults are judged to occur to be processed.

In fact, in most cases, the sending of heartbeat packets is unnecessary because we are really concerned about the failed storage node or disk, and thus heartbeat detection occupies unnecessary computational resources on the storage node and the monitoring node. In addition, the setting of the heartbeat detection time is also a troublesome problem, because if the heartbeat detection time is too short, misjudgment is easily caused, and system oscillation is caused; when the heartbeat time is long, it takes a long time to detect the failure when the failure occurs, so that the storage system is influenced by a long time.

Disclosure of Invention

The invention aims to provide a fault detection method and device for a distributed storage system and the distributed storage system, which are used for solving the problems that the traditional fault detection method detects faults of the distributed storage system in a heartbeat detection mode, so that some unnecessary computing resources on storage nodes and monitoring nodes are occupied, and heartbeat detection time is difficult to accurately set.

In order to solve the above technical problem, the present invention provides a method for detecting a failure of a distributed storage system, where the distributed storage system includes storage nodes, and the storage nodes include disks for storing data, and the method includes:

monitoring the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk;

judging whether the execution state is abnormal;

and if the execution state is abnormal, judging that the disk fails.

Optionally, in the process of invoking the process by the storage node to perform the data operation on the disk, before monitoring the execution state of the process, the method further includes:

responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node;

judging whether the network connection state is abnormal or not;

and if the network connection state is abnormal, judging that the storage node has a fault.

Optionally, after determining that the disk fails if the execution state is a state abnormality, or after determining that the storage node fails if the network connection state is an abnormality, the method further includes:

and sending fault prompt information to the monitoring nodes in the distributed storage system.

Optionally, the monitoring, in response to an operation request sent by a client to a storage node, a network connection state between the client and the storage node includes:

responding to an operation request sent by a client to a storage node, disconnecting the heartbeat connection between the storage node and the monitoring node, and monitoring the network connection state between the client and the storage node until the storage node finishes executing the operation corresponding to the operation request.

Optionally, after sending the disk failure notification information to the monitoring node in the distributed storage system, the method further includes:

and updating the real-time state of the storage nodes or the real-time state of the disks in the monitoring nodes, and reallocating the storage nodes or the disks executing the operation requests.

In addition, the present invention also provides a failure detection apparatus for a distributed storage system, where the distributed storage system includes a storage node, the storage node includes a disk for storing data, and the apparatus includes:

an execution state monitoring module: the method comprises the steps of monitoring the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk;

an execution state judgment module: the execution state is judged whether to be abnormal or not;

a disk failure determination module: and if the execution state is abnormal, determining that the disk fails.

Optionally, the apparatus further comprises:

network connection state monitoring module: the system comprises a monitoring module, a storage node and a server, wherein the monitoring module is used for monitoring the network connection state between a client and the storage node in response to an operation request sent by the client to the storage node;

network connection state judgment module: the network connection state judging module is used for judging whether the network connection state is abnormal or not;

a node fault determination module: and if the network connection state is abnormal, judging that the storage node has a fault.

Optionally, the apparatus further comprises:

the fault prompting module: the method is used for sending fault prompt information to the monitoring nodes in the distributed storage system.

Finally, the invention also provides a distributed storage system, which comprises a storage node, wherein the storage node is used for calling a process to execute data operation on a disk, monitoring the execution state of the process in the execution process, and judging that the disk fails when the execution state is abnormal.

Optionally, the distributed storage system further includes a monitoring node, where the storage node is configured to send a failure prompt message to the monitoring node when it is determined that the disk fails.

The invention provides a fault detection method of a distributed storage system, which is applied to the distributed storage system, wherein the system comprises storage nodes, the storage nodes comprise disks used for storing data, and the method comprises the following steps: in the process that the storage node calls the process to execute the data operation on the disk, the execution state of the process can be monitored, whether the execution state is abnormal or not is judged, and the disk is judged to have a fault when the execution state is abnormal. Therefore, in the process of performing data operation on the disk by the storage node, the process for executing the operation returns an execution state parameter, and the execution state reflects whether the data operation is successfully performed on the disk by the storage node, so that a signal of disk failure can be responsively captured according to the execution state of the process without heartbeat detection, thereby saving the computing resources of the storage node and avoiding the problem that the heartbeat interval is difficult to accurately set.

In addition, the invention also provides a fault detection device of the distributed slave storage system and the distributed storage system, the function of which corresponds to the method, and the description is omitted here.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a first implementation of a fault detection method for a distributed storage system according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an implementation of a second embodiment of a method for detecting a failure of a distributed storage system according to the present invention;

FIG. 3 is a functional block diagram of an embodiment of a fault detection apparatus for a distributed storage system according to the present invention;

fig. 4 is a block diagram of a distributed storage system according to the present invention.

Detailed Description

The core of the invention is to provide a fault detection method and device for a distributed storage system and the distributed storage system, which can responsively capture a signal of a disk fault according to an execution state of a process in a process of data operation from a storage node to a disk, save computing resources of the storage node, and avoid a problem that a heartbeat interval is difficult to accurately set.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a first embodiment of a method for detecting a failure of a distributed storage system according to the present invention is described as follows:

step S101: and monitoring the execution state of the process in the process of the data operation of the storage node calling the process to the disk.

The distributed storage system related to the embodiment comprises a client, a storage node and a monitoring node, wherein the client is used for sending an operation request to the storage node, the storage node is used for executing corresponding operation on data in a disk according to the operation request, and the monitoring node is used for detecting the fault of the storage node. The data operation in the above steps includes, but is not limited to, a read operation and a write operation, the process refers to a process for performing the read operation or the write operation, and the execution state of the process refers to a status parameter reflecting whether the process can successfully perform the read operation or the write operation on the disk.

Step S102: and judging whether the execution state is abnormal.

As described above, the execution state of the process in this embodiment refers to a state parameter that reflects whether the process can successfully perform a read operation or a write operation on the disk, and when the execution state meets the preset requirement, we consider that the state of the process is abnormal, that is, consider that a failure problem such as a dial or a sector damage occurs on the disk.

Step S103: and if the execution state is abnormal, judging that the disk fails.

After the failure of the disk is judged, a failure prompt signal can be generated and sent to the monitoring node, so that the disk failure is found in a signal capture mode, and the failure prompt signal is sent to the monitoring node, so that the failure node can process the failed disk. The method avoids the process that the storage node broadcasts the disk state to the monitoring node at regular intervals or the process that the monitoring node detects the storage node at regular intervals, thereby saving the computing resources of the storage node and the monitoring node, and realizing the purposes of responding to finding the disk fault and improving the fault processing efficiency.

The embodiment provides a fault detection method for a distributed storage system, which can monitor an execution state of a process, determine whether the execution state is abnormal or not, and determine that a disk has a fault when the execution state is abnormal in a process that a storage node calls the process to perform data operation on the disk. Therefore, in the process of performing data operation on the disk by the storage node, the process for executing the operation returns an execution state parameter, and the execution state reflects whether the data operation is successfully performed on the disk by the storage node, so that a signal of disk failure can be responsively captured according to the execution state of the process without heartbeat detection, thereby saving the computing resources of the storage node and avoiding the problem that the heartbeat interval is difficult to accurately set.

The second embodiment of the fault detection method for the distributed storage system provided by the invention is implemented based on the first embodiment, and is expanded to a certain extent on the basis of the first embodiment.

Specifically, in the first embodiment, only a disk failure is detected, however, in an actual application scenario, a storage node itself may also have a failure phenomenon such as power failure or network anomaly, and therefore, in the second embodiment, the following implementation process is provided in consideration of the above problem, and referring to fig. 2, the implementation method includes:

step S201: the method comprises the steps of responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node.

Specifically, in response to an operation request sent by the client to the storage node, the heartbeat connection between the storage node and the monitoring node is disconnected, and the network connection state between the client and the storage node is monitored until the storage node finishes executing the operation corresponding to the operation request. And for other fault scenes, normal heartbeat connection is kept, and fault processing is carried out when the heartbeat connection is abnormal. The storage node may be specifically an Object Storage Device (OSD).

Step S202: and judging whether the connection state is abnormal or not, if so, entering the step S203, and otherwise, entering the step S204.

If the read-write between the client and the storage node is abnormal, when the read-write abnormality exceeds a certain threshold, the storage node can be judged to be abnormal (power failure or network abnormality), and a storage node fault signal is actively reported to the monitoring node.

Step S203: it is determined that the storage node has failed, and the process proceeds to step S207.

Step S204: and monitoring the execution state of the process in the process of the data operation of the storage node calling the process to the disk.

Step S205: and judging whether the execution state is abnormal, and if so, entering the step S206.

That is, in the normal read-write process, if a certain disk has read-write abnormality (disk pulling or sector damage, etc.), then a read-write abnormality signal is sent out before the OSD process corresponding to the disk exits, so that the abnormality can be handled by the monitoring node by capturing the abnormality signal and actively reporting to the monitoring node.

Step S206: and judging that the disk fails.

Step S207: and sending a fault prompt message to the monitoring node.

Step S208: and updating the real-time state of the storage nodes or the real-time state of the disks in the monitoring nodes, and reallocating the storage nodes or the disks executing the operation requests.

If the reported disk fault signal is received, the monitoring node can immediately kick the fault disk out of the storage cluster, so that normal reading and writing of the storage cluster are ensured; if a node fault signal is received, the monitoring node can immediately kick the storage node out of the cluster, and actively trigger switching of the read-write service node IP to a normal node.

It can be seen that, according to the fault detection method for the distributed storage system provided by this embodiment, when a disk fails, a monitoring node is reported in a signal capturing manner, so as to accelerate fault processing; when a node fails (power failure or network abnormality), the abnormality is detected in an auxiliary mode through connection between the client and the cluster, when the read-write abnormality exceeds a certain threshold, the monitoring node is reported, and the monitoring node actively kicks the failed node out of the cluster, so that the processing time of a node failure scene is shortened, and the reliability and the stability of the whole cluster are improved.

In the following, a fault detection apparatus of a distributed storage system according to an embodiment of the present invention is introduced, and a fault detection apparatus of a distributed storage system described below and a fault detection method of a distributed storage system described above may be referred to correspondingly.

The distributed storage system related to the embodiment of the apparatus includes storage nodes, and the storage nodes include disks for storing data, as shown in fig. 3, the embodiment of the apparatus includes:

the execution state monitoring module 301: the method is used for monitoring the execution state of the process in the process that the storage node calls the process to execute data operation on the disk.

The execution state determination module 302: and the execution state is judged whether to be abnormal.

Disk failure determination module 303: and if the execution state is abnormal, determining that the disk fails.

As an optional implementation, the apparatus further comprises:

the network connection status monitoring module 304: the method comprises the steps of responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node.

Network connection state determination module 305: and the network connection state judging module is used for judging whether the network connection state is abnormal or not.

Node failure determination module 306: and if the network connection state is abnormal, judging that the storage node has a fault.

As an optional implementation, the apparatus further comprises:

the fault prompting module 307: the method is used for sending fault prompt information to the monitoring nodes in the distributed storage system.

Therefore, specific embodiments of the apparatus in the embodiment of the method for detecting a fault in a distributed storage system in the foregoing may be seen in that, for example, the execution state monitoring module 301, the execution state judging module 302, and the disk fault determining module 303 are respectively used to implement steps S101, S102, and S103 in the method for detecting a fault in a distributed storage system in the foregoing. Therefore, specific embodiments thereof may be referred to in the description of the corresponding respective partial embodiments, and will not be described herein.

In addition, since the fault detection apparatus of the distributed storage system of this embodiment is used to implement the fault detection method of the distributed storage system, the role of the fault detection apparatus corresponds to that of the method described above, and details are not described here.

In addition, the present invention further provides an embodiment of a distributed storage system, and a distributed storage system described below and a fault detection method of the distributed storage system described above may be referred to in correspondence.

As shown in fig. 4, the distributed storage system includes: a plurality of storage nodes 401, a monitoring node 402, a client 403. The basic functions of each component are: the client 403 is configured to send an operation request to the storage node 401 through the public network, the storage node 401 is configured to invoke a process to perform corresponding data operation to a disk in response to the operation request, and the monitoring node 402 is configured to monitor a state of each storage node 401 through the public network.

In this embodiment, the storage node 401 is configured to monitor an execution state of a process in a process of invoking the process to perform a data operation on a disk, determine that the disk fails when the execution state is abnormal, and send a disk failure notification message to the monitoring node 402.

In addition, the storage node 401 is further configured to monitor a network connection state between the client 403 and the storage node 401 in response to an operation request sent by the client 403 to the storage node 401, determine that the storage node 401 has a fault when the network connection state is abnormal, and report a node fault notification message to the monitoring node 402. Specifically, the storage node 401, in response to an operation request sent by a client 403 to the storage node 401, disconnects the heartbeat connection between the storage node 401 and the monitoring node 402, and monitors the network connection state between the client 403 and the storage node 401 until the storage node 401 finishes executing an operation corresponding to the operation request.

As an optional implementation manner, the monitoring node 402 is configured to, after receiving the failure notification message, update a real-time status of a storage node 401 or a real-time status of a disk in the monitoring node 402, and reallocate the storage node 401 or the disk that executes the operation request.

The present invention is directed to a distributed storage system, and therefore, a specific implementation of the system may be found in the foregoing description of the embodiment of the fault detection method of the distributed storage system, and its function corresponds to the above method, and will not be described here.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The method, the device and the distributed storage system for detecting the fault of the distributed storage system provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A method for fault detection in a distributed storage system, the distributed storage system comprising storage nodes, the storage nodes comprising disks for storing data, the method comprising:

judging whether the execution state is abnormal;

if the execution state is abnormal, judging that the disk fails;

before monitoring the execution state of the process in the process of the storage node calling the process to execute the data operation to the disk, the method further comprises the following steps:

judging whether the network connection state is abnormal or not;

if the network connection state is abnormal, judging that the storage node has a fault;

the monitoring a network connection state between a client and a storage node in response to an operation request sent by the client to the storage node specifically includes:

responding to an operation request sent by a client to a storage node, disconnecting the heartbeat connection between the storage node and a monitoring node, and monitoring the network connection state between the client and the storage node until the storage node finishes executing the operation corresponding to the operation request;

after the determining that the disk fails if the execution state is a state abnormality, or after the determining that the storage node fails if the network connection state is an abnormality, the method further includes:

2. The method of claim 1, wherein after sending the disk failure hint information to the monitoring nodes in the distributed storage system, further comprising:

3. An apparatus for fault detection in a distributed storage system, the distributed storage system comprising storage nodes including disks for storing data, the apparatus comprising:

a disk failure determination module: the failure detection module is used for judging that the disk fails if the execution state is abnormal;

the device further comprises:

a node fault determination module: the storage node is used for judging that the storage node has a fault if the network connection state is abnormal;

the network connection state monitoring module is specifically configured to:

further comprising:

4. A distributed storage system is characterized by comprising storage nodes, wherein the distributed storage system is used for responding to an operation request sent by a client to the storage nodes and monitoring the network connection state between the client and the storage nodes; judging whether the network connection state is abnormal or not; if the network connection state is abnormal, judging that the storage node has a fault; if the network connection state is normal, the storage node is used for calling a process to execute data operation on a disk, monitoring the execution state of the process in the execution process, and judging that the disk has a fault when the execution state is abnormal;

the distributed storage system further comprises a monitoring node, and the storage node is used for sending fault prompt information to the monitoring node when the disk is judged to have a fault.