CN109274544B - Fault detection method and device for distributed storage system - Google Patents

Fault detection method and device for distributed storage system Download PDF

Info

Publication number
CN109274544B
CN109274544B CN201811511589.6A CN201811511589A CN109274544B CN 109274544 B CN109274544 B CN 109274544B CN 201811511589 A CN201811511589 A CN 201811511589A CN 109274544 B CN109274544 B CN 109274544B
Authority
CN
China
Prior art keywords
storage node
monitoring
node
network connection
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811511589.6A
Other languages
Chinese (zh)
Other versions
CN109274544A (en
Inventor
许银龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201811511589.6A priority Critical patent/CN109274544B/en
Publication of CN109274544A publication Critical patent/CN109274544A/en
Application granted granted Critical
Publication of CN109274544B publication Critical patent/CN109274544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault detection method of a distributed storage system, which can monitor the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk and judge that the disk has a fault when the execution state is abnormal. Therefore, in the process of performing data operation on the disk by the storage node, the process for executing the operation returns an execution state parameter, and the execution state reflects whether the data operation is successfully performed on the disk by the storage node, so that a signal of disk failure can be responsively captured according to the execution state of the process without heartbeat detection, thereby saving the computing resources of the storage node and avoiding the problem that the heartbeat interval is difficult to accurately set. In addition, the invention also provides a fault detection device of the distributed slave storage system and the distributed storage system, and the function of the distributed slave storage system corresponds to the method.

Description

Fault detection method and device for distributed storage system
Technical Field
The present invention relates to the field of storage, and in particular, to a method and an apparatus for detecting a failure of a distributed storage system, and a distributed storage system.
Background
With the rise and popularization of cloud computing technology, distributed storage systems are getting more and more concerned by the industry. The traditional network storage system adopts a centralized storage server to store all data, the storage server becomes the bottleneck of the system performance, is also the focus of reliability and safety, and cannot meet the requirement of large-scale storage application. The distributed network storage system adopts an expandable system structure, and utilizes a plurality of storage servers to share the storage load, thereby not only improving the reliability, the availability and the access efficiency of the system, but also being easy to expand. For distributed storage, stability and reliability of the clusters are crucial.
When a disk or a storage node of a distributed system fails, stability and reliability parameters of a cluster are often greatly influenced, and the storage system may be temporarily unavailable. The current distributed storage system usually adopts a heartbeat detection mode to detect faults, and when the heartbeat time detects timeout, the faults are judged to occur to be processed.
In fact, in most cases, the sending of heartbeat packets is unnecessary because we are really concerned about the failed storage node or disk, and thus heartbeat detection occupies unnecessary computational resources on the storage node and the monitoring node. In addition, the setting of the heartbeat detection time is also a troublesome problem, because if the heartbeat detection time is too short, misjudgment is easily caused, and system oscillation is caused; when the heartbeat time is long, it takes a long time to detect the failure when the failure occurs, so that the storage system is influenced by a long time.
Disclosure of Invention
The invention aims to provide a fault detection method and device for a distributed storage system and the distributed storage system, which are used for solving the problems that the traditional fault detection method detects faults of the distributed storage system in a heartbeat detection mode, so that some unnecessary computing resources on storage nodes and monitoring nodes are occupied, and heartbeat detection time is difficult to accurately set.
In order to solve the above technical problem, the present invention provides a method for detecting a failure of a distributed storage system, where the distributed storage system includes storage nodes, and the storage nodes include disks for storing data, and the method includes:
monitoring the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk;
judging whether the execution state is abnormal;
and if the execution state is abnormal, judging that the disk fails.
Optionally, in the process of invoking the process by the storage node to perform the data operation on the disk, before monitoring the execution state of the process, the method further includes:
responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node;
judging whether the network connection state is abnormal or not;
and if the network connection state is abnormal, judging that the storage node has a fault.
Optionally, after determining that the disk fails if the execution state is a state abnormality, or after determining that the storage node fails if the network connection state is an abnormality, the method further includes:
and sending fault prompt information to the monitoring nodes in the distributed storage system.
Optionally, the monitoring, in response to an operation request sent by a client to a storage node, a network connection state between the client and the storage node includes:
responding to an operation request sent by a client to a storage node, disconnecting the heartbeat connection between the storage node and the monitoring node, and monitoring the network connection state between the client and the storage node until the storage node finishes executing the operation corresponding to the operation request.
Optionally, after sending the disk failure notification information to the monitoring node in the distributed storage system, the method further includes:
and updating the real-time state of the storage nodes or the real-time state of the disks in the monitoring nodes, and reallocating the storage nodes or the disks executing the operation requests.
In addition, the present invention also provides a failure detection apparatus for a distributed storage system, where the distributed storage system includes a storage node, the storage node includes a disk for storing data, and the apparatus includes:
an execution state monitoring module: the method comprises the steps of monitoring the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk;
an execution state judgment module: the execution state is judged whether to be abnormal or not;
a disk failure determination module: and if the execution state is abnormal, determining that the disk fails.
Optionally, the apparatus further comprises:
network connection state monitoring module: the system comprises a monitoring module, a storage node and a server, wherein the monitoring module is used for monitoring the network connection state between a client and the storage node in response to an operation request sent by the client to the storage node;
network connection state judgment module: the network connection state judging module is used for judging whether the network connection state is abnormal or not;
a node fault determination module: and if the network connection state is abnormal, judging that the storage node has a fault.
Optionally, the apparatus further comprises:
the fault prompting module: the method is used for sending fault prompt information to the monitoring nodes in the distributed storage system.
Finally, the invention also provides a distributed storage system, which comprises a storage node, wherein the storage node is used for calling a process to execute data operation on a disk, monitoring the execution state of the process in the execution process, and judging that the disk fails when the execution state is abnormal.
Optionally, the distributed storage system further includes a monitoring node, where the storage node is configured to send a failure prompt message to the monitoring node when it is determined that the disk fails.
The invention provides a fault detection method of a distributed storage system, which is applied to the distributed storage system, wherein the system comprises storage nodes, the storage nodes comprise disks used for storing data, and the method comprises the following steps: in the process that the storage node calls the process to execute the data operation on the disk, the execution state of the process can be monitored, whether the execution state is abnormal or not is judged, and the disk is judged to have a fault when the execution state is abnormal. Therefore, in the process of performing data operation on the disk by the storage node, the process for executing the operation returns an execution state parameter, and the execution state reflects whether the data operation is successfully performed on the disk by the storage node, so that a signal of disk failure can be responsively captured according to the execution state of the process without heartbeat detection, thereby saving the computing resources of the storage node and avoiding the problem that the heartbeat interval is difficult to accurately set.
In addition, the invention also provides a fault detection device of the distributed slave storage system and the distributed storage system, the function of which corresponds to the method, and the description is omitted here.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a first implementation of a fault detection method for a distributed storage system according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an implementation of a second embodiment of a method for detecting a failure of a distributed storage system according to the present invention;
FIG. 3 is a functional block diagram of an embodiment of a fault detection apparatus for a distributed storage system according to the present invention;
fig. 4 is a block diagram of a distributed storage system according to the present invention.
Detailed Description
The core of the invention is to provide a fault detection method and device for a distributed storage system and the distributed storage system, which can responsively capture a signal of a disk fault according to an execution state of a process in a process of data operation from a storage node to a disk, save computing resources of the storage node, and avoid a problem that a heartbeat interval is difficult to accurately set.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a first embodiment of a method for detecting a failure of a distributed storage system according to the present invention is described as follows:
step S101: and monitoring the execution state of the process in the process of the data operation of the storage node calling the process to the disk.
The distributed storage system related to the embodiment comprises a client, a storage node and a monitoring node, wherein the client is used for sending an operation request to the storage node, the storage node is used for executing corresponding operation on data in a disk according to the operation request, and the monitoring node is used for detecting the fault of the storage node. The data operation in the above steps includes, but is not limited to, a read operation and a write operation, the process refers to a process for performing the read operation or the write operation, and the execution state of the process refers to a status parameter reflecting whether the process can successfully perform the read operation or the write operation on the disk.
Step S102: and judging whether the execution state is abnormal.
As described above, the execution state of the process in this embodiment refers to a state parameter that reflects whether the process can successfully perform a read operation or a write operation on the disk, and when the execution state meets the preset requirement, we consider that the state of the process is abnormal, that is, consider that a failure problem such as a dial or a sector damage occurs on the disk.
Step S103: and if the execution state is abnormal, judging that the disk fails.
After the failure of the disk is judged, a failure prompt signal can be generated and sent to the monitoring node, so that the disk failure is found in a signal capture mode, and the failure prompt signal is sent to the monitoring node, so that the failure node can process the failed disk. The method avoids the process that the storage node broadcasts the disk state to the monitoring node at regular intervals or the process that the monitoring node detects the storage node at regular intervals, thereby saving the computing resources of the storage node and the monitoring node, and realizing the purposes of responding to finding the disk fault and improving the fault processing efficiency.
The embodiment provides a fault detection method for a distributed storage system, which can monitor an execution state of a process, determine whether the execution state is abnormal or not, and determine that a disk has a fault when the execution state is abnormal in a process that a storage node calls the process to perform data operation on the disk. Therefore, in the process of performing data operation on the disk by the storage node, the process for executing the operation returns an execution state parameter, and the execution state reflects whether the data operation is successfully performed on the disk by the storage node, so that a signal of disk failure can be responsively captured according to the execution state of the process without heartbeat detection, thereby saving the computing resources of the storage node and avoiding the problem that the heartbeat interval is difficult to accurately set.
The second embodiment of the fault detection method for the distributed storage system provided by the invention is implemented based on the first embodiment, and is expanded to a certain extent on the basis of the first embodiment.
Specifically, in the first embodiment, only a disk failure is detected, however, in an actual application scenario, a storage node itself may also have a failure phenomenon such as power failure or network anomaly, and therefore, in the second embodiment, the following implementation process is provided in consideration of the above problem, and referring to fig. 2, the implementation method includes:
step S201: the method comprises the steps of responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node.
Specifically, in response to an operation request sent by the client to the storage node, the heartbeat connection between the storage node and the monitoring node is disconnected, and the network connection state between the client and the storage node is monitored until the storage node finishes executing the operation corresponding to the operation request. And for other fault scenes, normal heartbeat connection is kept, and fault processing is carried out when the heartbeat connection is abnormal. The storage node may be specifically an Object Storage Device (OSD).
Step S202: and judging whether the connection state is abnormal or not, if so, entering the step S203, and otherwise, entering the step S204.
If the read-write between the client and the storage node is abnormal, when the read-write abnormality exceeds a certain threshold, the storage node can be judged to be abnormal (power failure or network abnormality), and a storage node fault signal is actively reported to the monitoring node.
Step S203: it is determined that the storage node has failed, and the process proceeds to step S207.
Step S204: and monitoring the execution state of the process in the process of the data operation of the storage node calling the process to the disk.
Step S205: and judging whether the execution state is abnormal, and if so, entering the step S206.
That is, in the normal read-write process, if a certain disk has read-write abnormality (disk pulling or sector damage, etc.), then a read-write abnormality signal is sent out before the OSD process corresponding to the disk exits, so that the abnormality can be handled by the monitoring node by capturing the abnormality signal and actively reporting to the monitoring node.
Step S206: and judging that the disk fails.
Step S207: and sending a fault prompt message to the monitoring node.
Step S208: and updating the real-time state of the storage nodes or the real-time state of the disks in the monitoring nodes, and reallocating the storage nodes or the disks executing the operation requests.
If the reported disk fault signal is received, the monitoring node can immediately kick the fault disk out of the storage cluster, so that normal reading and writing of the storage cluster are ensured; if a node fault signal is received, the monitoring node can immediately kick the storage node out of the cluster, and actively trigger switching of the read-write service node IP to a normal node.
It can be seen that, according to the fault detection method for the distributed storage system provided by this embodiment, when a disk fails, a monitoring node is reported in a signal capturing manner, so as to accelerate fault processing; when a node fails (power failure or network abnormality), the abnormality is detected in an auxiliary mode through connection between the client and the cluster, when the read-write abnormality exceeds a certain threshold, the monitoring node is reported, and the monitoring node actively kicks the failed node out of the cluster, so that the processing time of a node failure scene is shortened, and the reliability and the stability of the whole cluster are improved.
In the following, a fault detection apparatus of a distributed storage system according to an embodiment of the present invention is introduced, and a fault detection apparatus of a distributed storage system described below and a fault detection method of a distributed storage system described above may be referred to correspondingly.
The distributed storage system related to the embodiment of the apparatus includes storage nodes, and the storage nodes include disks for storing data, as shown in fig. 3, the embodiment of the apparatus includes:
the execution state monitoring module 301: the method is used for monitoring the execution state of the process in the process that the storage node calls the process to execute data operation on the disk.
The execution state determination module 302: and the execution state is judged whether to be abnormal.
Disk failure determination module 303: and if the execution state is abnormal, determining that the disk fails.
As an optional implementation, the apparatus further comprises:
the network connection status monitoring module 304: the method comprises the steps of responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node.
Network connection state determination module 305: and the network connection state judging module is used for judging whether the network connection state is abnormal or not.
Node failure determination module 306: and if the network connection state is abnormal, judging that the storage node has a fault.
As an optional implementation, the apparatus further comprises:
the fault prompting module 307: the method is used for sending fault prompt information to the monitoring nodes in the distributed storage system.
Therefore, specific embodiments of the apparatus in the embodiment of the method for detecting a fault in a distributed storage system in the foregoing may be seen in that, for example, the execution state monitoring module 301, the execution state judging module 302, and the disk fault determining module 303 are respectively used to implement steps S101, S102, and S103 in the method for detecting a fault in a distributed storage system in the foregoing. Therefore, specific embodiments thereof may be referred to in the description of the corresponding respective partial embodiments, and will not be described herein.
In addition, since the fault detection apparatus of the distributed storage system of this embodiment is used to implement the fault detection method of the distributed storage system, the role of the fault detection apparatus corresponds to that of the method described above, and details are not described here.
In addition, the present invention further provides an embodiment of a distributed storage system, and a distributed storage system described below and a fault detection method of the distributed storage system described above may be referred to in correspondence.
As shown in fig. 4, the distributed storage system includes: a plurality of storage nodes 401, a monitoring node 402, a client 403. The basic functions of each component are: the client 403 is configured to send an operation request to the storage node 401 through the public network, the storage node 401 is configured to invoke a process to perform corresponding data operation to a disk in response to the operation request, and the monitoring node 402 is configured to monitor a state of each storage node 401 through the public network.
In this embodiment, the storage node 401 is configured to monitor an execution state of a process in a process of invoking the process to perform a data operation on a disk, determine that the disk fails when the execution state is abnormal, and send a disk failure notification message to the monitoring node 402.
In addition, the storage node 401 is further configured to monitor a network connection state between the client 403 and the storage node 401 in response to an operation request sent by the client 403 to the storage node 401, determine that the storage node 401 has a fault when the network connection state is abnormal, and report a node fault notification message to the monitoring node 402. Specifically, the storage node 401, in response to an operation request sent by a client 403 to the storage node 401, disconnects the heartbeat connection between the storage node 401 and the monitoring node 402, and monitors the network connection state between the client 403 and the storage node 401 until the storage node 401 finishes executing an operation corresponding to the operation request.
As an optional implementation manner, the monitoring node 402 is configured to, after receiving the failure notification message, update a real-time status of a storage node 401 or a real-time status of a disk in the monitoring node 402, and reallocate the storage node 401 or the disk that executes the operation request.
The present invention is directed to a distributed storage system, and therefore, a specific implementation of the system may be found in the foregoing description of the embodiment of the fault detection method of the distributed storage system, and its function corresponds to the above method, and will not be described here.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The method, the device and the distributed storage system for detecting the fault of the distributed storage system provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (4)

1. A method for fault detection in a distributed storage system, the distributed storage system comprising storage nodes, the storage nodes comprising disks for storing data, the method comprising:
monitoring the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk;
judging whether the execution state is abnormal;
if the execution state is abnormal, judging that the disk fails;
before monitoring the execution state of the process in the process of the storage node calling the process to execute the data operation to the disk, the method further comprises the following steps:
responding to an operation request sent by a client to a storage node, and monitoring the network connection state between the client and the storage node;
judging whether the network connection state is abnormal or not;
if the network connection state is abnormal, judging that the storage node has a fault;
the monitoring a network connection state between a client and a storage node in response to an operation request sent by the client to the storage node specifically includes:
responding to an operation request sent by a client to a storage node, disconnecting the heartbeat connection between the storage node and a monitoring node, and monitoring the network connection state between the client and the storage node until the storage node finishes executing the operation corresponding to the operation request;
after the determining that the disk fails if the execution state is a state abnormality, or after the determining that the storage node fails if the network connection state is an abnormality, the method further includes:
and sending fault prompt information to the monitoring nodes in the distributed storage system.
2. The method of claim 1, wherein after sending the disk failure hint information to the monitoring nodes in the distributed storage system, further comprising:
and updating the real-time state of the storage nodes or the real-time state of the disks in the monitoring nodes, and reallocating the storage nodes or the disks executing the operation requests.
3. An apparatus for fault detection in a distributed storage system, the distributed storage system comprising storage nodes including disks for storing data, the apparatus comprising:
an execution state monitoring module: the method comprises the steps of monitoring the execution state of a process in the process of calling the process by a storage node to execute data operation on a disk;
an execution state judgment module: the execution state is judged whether to be abnormal or not;
a disk failure determination module: the failure detection module is used for judging that the disk fails if the execution state is abnormal;
the device further comprises:
network connection state monitoring module: the system comprises a monitoring module, a storage node and a server, wherein the monitoring module is used for monitoring the network connection state between a client and the storage node in response to an operation request sent by the client to the storage node;
network connection state judgment module: the network connection state judging module is used for judging whether the network connection state is abnormal or not;
a node fault determination module: the storage node is used for judging that the storage node has a fault if the network connection state is abnormal;
the network connection state monitoring module is specifically configured to:
responding to an operation request sent by a client to a storage node, disconnecting the heartbeat connection between the storage node and a monitoring node, and monitoring the network connection state between the client and the storage node until the storage node finishes executing the operation corresponding to the operation request;
further comprising:
the fault prompting module: the method is used for sending fault prompt information to the monitoring nodes in the distributed storage system.
4. A distributed storage system is characterized by comprising storage nodes, wherein the distributed storage system is used for responding to an operation request sent by a client to the storage nodes and monitoring the network connection state between the client and the storage nodes; judging whether the network connection state is abnormal or not; if the network connection state is abnormal, judging that the storage node has a fault; if the network connection state is normal, the storage node is used for calling a process to execute data operation on a disk, monitoring the execution state of the process in the execution process, and judging that the disk has a fault when the execution state is abnormal;
the monitoring a network connection state between a client and a storage node in response to an operation request sent by the client to the storage node specifically includes:
responding to an operation request sent by a client to a storage node, disconnecting the heartbeat connection between the storage node and a monitoring node, and monitoring the network connection state between the client and the storage node until the storage node finishes executing the operation corresponding to the operation request;
the distributed storage system further comprises a monitoring node, and the storage node is used for sending fault prompt information to the monitoring node when the disk is judged to have a fault.
CN201811511589.6A 2018-12-11 2018-12-11 Fault detection method and device for distributed storage system Active CN109274544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811511589.6A CN109274544B (en) 2018-12-11 2018-12-11 Fault detection method and device for distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811511589.6A CN109274544B (en) 2018-12-11 2018-12-11 Fault detection method and device for distributed storage system

Publications (2)

Publication Number Publication Date
CN109274544A CN109274544A (en) 2019-01-25
CN109274544B true CN109274544B (en) 2021-06-29

Family

ID=65186913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811511589.6A Active CN109274544B (en) 2018-12-11 2018-12-11 Fault detection method and device for distributed storage system

Country Status (1)

Country Link
CN (1) CN109274544B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110554839A (en) * 2019-07-30 2019-12-10 华为技术有限公司 distributed storage system access method, client and computer program product
CN111176916B (en) * 2019-12-20 2023-04-07 国久大数据有限公司 Data storage fault diagnosis method and system
CN111600770B (en) * 2020-04-08 2022-12-02 贵州大方发电有限公司 DCS (distributed control system) annular network fault monitoring system, method and device
CN111756571B (en) 2020-05-28 2022-02-18 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
CN111817920A (en) * 2020-07-17 2020-10-23 济南浪潮数据技术有限公司 Method, device and system for optimizing load of distributed storage system and storage medium
CN112306781B (en) * 2020-11-20 2022-08-19 新华三大数据技术有限公司 Thread fault processing method, device, medium and equipment
CN112732494B (en) * 2020-12-29 2024-02-13 北京浪潮数据技术有限公司 Bad disk replacement method, device, equipment and medium of storage system
CN115629906B (en) * 2022-12-21 2023-03-21 北京铜牛信息科技股份有限公司 Method and system for recovering cloud distributed storage data fault

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823708A (en) * 2014-02-27 2014-05-28 深圳市深信服电子科技有限公司 Virtual machine read-write request processing method and device
CN106970851A (en) * 2016-01-14 2017-07-21 阿里巴巴集团控股有限公司 Method and apparatus for disk detection process in distributed file system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7496348B2 (en) * 2005-06-07 2009-02-24 Motorola, Inc. Wireless communication network security method and system
US9021299B2 (en) * 2011-02-18 2015-04-28 Ab Initio Technology Llc Restarting processes
CN103298013B (en) * 2013-06-24 2016-08-10 京信通信系统(中国)有限公司 A kind of method and device carrying out business recovery

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823708A (en) * 2014-02-27 2014-05-28 深圳市深信服电子科技有限公司 Virtual machine read-write request processing method and device
CN106970851A (en) * 2016-01-14 2017-07-21 阿里巴巴集团控股有限公司 Method and apparatus for disk detection process in distributed file system

Also Published As

Publication number Publication date
CN109274544A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109274544B (en) Fault detection method and device for distributed storage system
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
US10095576B2 (en) Anomaly recovery method for virtual machine in distributed environment
US20180212819A1 (en) Troubleshooting Method and Apparatus
CN108173911B (en) Micro-service fault detection processing method and device
CN106533805B (en) Micro-service request processing method, micro-service controller and micro-service architecture
CN106933843B (en) Database heartbeat detection method and device
CN108737574B (en) Node offline judgment method, device, equipment and readable storage medium
CN107729185B (en) Fault processing method and device
CN110659159A (en) Service process operation monitoring method, device, equipment and storage medium
CN112769652B (en) Node service monitoring method, device, equipment and medium
CN110740167A (en) distributed storage system and node monitoring method thereof
CN111212127A (en) Storage cluster, service data maintenance method, device and storage medium
CN112988433A (en) Method, apparatus and computer program product for fault management
US11930292B2 (en) Device state monitoring method and apparatus
CN114168071B (en) Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
CN110740064A (en) Distributed cluster node fault processing method, device, equipment and storage medium
CN111342986B (en) Distributed node management method and device, distributed system and storage medium
CN111314443A (en) Node processing method, device and equipment based on distributed storage system and medium
CN114826962A (en) Link fault detection method, device, equipment and machine readable storage medium
JP6421516B2 (en) Server device, redundant server system, information takeover program, and information takeover method
CN112069032A (en) Availability detection method, system and related device for virtual machine
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN113254245A (en) Fault detection method and system for storage cluster
CN114296979A (en) Method and device for detecting abnormal state of Internet of things equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant