CN113254245A - Fault detection method and system for storage cluster - Google Patents
Fault detection method and system for storage cluster Download PDFInfo
- Publication number
- CN113254245A CN113254245A CN202010090855.3A CN202010090855A CN113254245A CN 113254245 A CN113254245 A CN 113254245A CN 202010090855 A CN202010090855 A CN 202010090855A CN 113254245 A CN113254245 A CN 113254245A
- Authority
- CN
- China
- Prior art keywords
- detection
- node
- storage node
- storage
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 242
- 238000000034 method Methods 0.000 claims abstract description 28
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 14
- 230000004044 response Effects 0.000 claims description 21
- 239000000523 sample Substances 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 9
- 241001522296 Erithacus rubecula Species 0.000 claims description 6
- 230000004083 survival effect Effects 0.000 abstract description 6
- 230000007547 defect Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 19
- 230000008569 process Effects 0.000 description 11
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0727—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3034—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a fault detection method and system for a storage cluster, and relates to the technical field of computers. One embodiment of the method comprises: carrying out cyclic detection on each storage node in the storage cluster through a plurality of detection nodes; and for each storage node, judging whether the storage node fails according to the detection results of the plurality of detection nodes. According to the embodiment, survival detection can be performed on the storage cluster node without using a Zookeeper server, fault misjudgment on the storage cluster node due to network flash, disconnection and the like is avoided, and the defect that fault detection is unavailable integrally is avoided.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a fault detection method and system for a storage cluster.
Background
At present, an enterprise provides services to the outside by using a storage cluster, for example, provides services to the outside by using a Redis (remote dictionary service, a key-value type database) cluster, and in order to monitor a failure of a node in the Redis cluster, an existing scheme is to use a Zookeeper (a distributed, open-source distributed application coordination service) temporary node to monitor node survival, that is, each Redis node is connected to a Zookeeper server and creates a corresponding temporary node, a detector adds a watchdog monitor (a monitoring mechanism provided by the Zookeeper) to the temporary nodes, and when the Redis node fails, the corresponding temporary node on the Zookeeper is automatically deleted, and because the detector adds the watchdog monitor to the temporary node, a deletion notification is received to determine the failure of the Redis node.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
when the nodes of the storage cluster are disconnected with the Zookeeper server due to network flash, the corresponding Zookeeper server temporary nodes are automatically deleted, and the watchdog receives the deletion notification and considers that the nodes of the storage cluster are in failure, so that misjudgment is caused; when the number of the Zookeeper routers per se and the number of the connections are too large, performance bottlenecks occur, and the fault detection on the storage cluster nodes is not available as a whole.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a system for detecting a failure of a storage cluster, so that a Zookeeper server is not needed to perform survival detection on a storage cluster node, thereby avoiding misjudgment of the failure of the storage cluster node due to network flash, outage, and the like, and avoiding a defect that the failure detection is wholly unavailable.
To achieve the above object, according to an aspect of the embodiments of the present invention, a method for detecting a failure of a storage cluster is provided.
A failure detection method of a storage cluster comprises the following steps: carrying out cyclic detection on each storage node in the storage cluster through a plurality of detection nodes; and for each storage node, judging whether the storage node fails according to the detection results of the plurality of detection nodes.
Optionally, the performing, by a plurality of probe nodes, a cyclic probe on each storage node in the storage cluster includes: the plurality of detection nodes send detection requests to each storage node in the storage cluster in a circulating mode, wherein each detection node sends the detection request to each storage node once in each circulation; if a first detection node receives a response to a first detection request returned by a first storage node in a round robin, generating the detection result indicating that the first storage node is normal, wherein the first detection node is one of the plurality of detection nodes, the first storage node is one of the storage nodes, and the first detection request is the detection request sent by the first detection node to the first storage node; and if the first detection node does not receive a response to the first detection request returned by the first storage node in N successive cycles, wherein N is a preset value, generating a detection result indicating the failure of the first storage node.
Optionally, the determining, according to the detection results of the plurality of detection nodes, whether the storage node fails includes: traversing the detection results of the plurality of detection nodes, judging whether the detection results of at least a preset number of detection nodes indicate that the storage node is normal, and if so, judging that the storage node is normal; otherwise, the storage node is judged to be in failure.
Optionally, the plurality of probing nodes perform cyclic probing on each storage node in the storage cluster through different dedicated network lines.
According to another aspect of the embodiments of the present invention, a system for detecting a failure of a storage cluster is provided.
A failure detection system for a storage cluster, comprising: the detection module is used for circularly detecting each storage node in the storage cluster through a plurality of detection nodes; and the judging module is used for judging whether the storage node fails or not according to the detection results of the plurality of detection nodes for each storage node.
Optionally, the detection module is further configured to: sending a probe request to each storage node in the storage cluster in a circulating manner through the plurality of probe nodes, wherein each probe node sends the probe request to each storage node once in each circulation; if a first detection node receives a response to a first detection request returned by a first storage node in a round robin, generating the detection result indicating that the first storage node is normal, wherein the first detection node is one of the plurality of detection nodes, the first storage node is one of the storage nodes, and the first detection request is the detection request sent by the first detection node to the first storage node; and if the first detection node does not receive a response to the first detection request returned by the first storage node in N successive cycles, wherein N is a preset value, generating a detection result indicating the failure of the first storage node.
Optionally, the determining module is further configured to: traversing the detection results of the plurality of detection nodes, judging whether the detection results of at least a preset number of detection nodes indicate that the storage node is normal, and if so, judging that the storage node is normal; otherwise, the storage node is judged to be in failure.
Optionally, the plurality of probing nodes of the probing module perform cyclic probing on each storage node in the storage cluster through different dedicated network lines.
According to yet another aspect of an embodiment of the present invention, an electronic device is provided.
An electronic device, comprising: one or more processors; a memory for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the failure detection method for a storage cluster provided by an embodiment of the present invention.
According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.
A computer readable medium, on which a computer program is stored, which when executed by a processor implements a method for fault detection of a storage cluster provided by an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: and circularly detecting each storage node in the storage cluster through a plurality of detection nodes, and judging whether the storage node fails or not according to the detection results of the plurality of detection nodes for each storage node. Survival detection can be carried out on the storage cluster nodes without using a Zookeeper server, fault misjudgment on the storage cluster nodes caused by network flash, disconnection and the like is avoided, and the defect that fault detection is unavailable integrally is avoided.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a storage cluster fault detection method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a Redis cluster deployment according to one embodiment of the invention;
FIG. 3 is a schematic diagram of a storage node probing process according to one embodiment of the invention;
FIG. 4 is a schematic diagram illustrating a storage node failure determination process according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the major modules of a failure detection system of a storage cluster, according to one embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a fault detection system of a storage cluster according to an embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of main steps of a storage cluster fault detection method according to an embodiment of the present invention.
As shown in fig. 1, a method for detecting a failure of a storage cluster according to an embodiment of the present invention mainly includes the following steps S101 to S102.
Step S101: and circularly detecting each storage node in the storage cluster through a plurality of detection nodes.
Step S102: and for each storage node, judging whether the storage node fails according to the detection results of the plurality of detection nodes.
Through a plurality of detection nodes, performing cyclic detection on each storage node in the storage cluster, which may specifically include: the method comprises the steps that a plurality of detection nodes send detection requests to storage nodes in a storage cluster in a circulating mode, wherein each detection node sends the detection requests to the storage nodes once in each circulation; if the first detection node receives a response to the first detection request returned by the first storage node in a round robin, generating a detection result indicating that the first storage node is normal, wherein the first detection node is one of a plurality of detection nodes, the first storage node is one of the storage nodes, and the first detection request is a detection request sent by the first detection node to the first storage node; and if the first detection node does not receive the response to the first detection request returned by the first storage node in N continuous cycles, and N is a preset value, generating a detection result indicating the failure of the first storage node.
Preferably, the first probe node sends a first probe request to the first storage node and receives a response to the first probe request returned by the first storage node, if a certain cycle does not receive the response returned by the first storage node, the number of failures is recorded as 1 (i.e. 1 is accumulated on the basis of an initial value of 0), if the response returned by the first storage node is not received after the first probe request is sent to the first storage node in the next cycle, the number of failures is accumulated as 1, i.e. the number of failures is recorded as 2, and so on, if the response returned by the first storage node is not received in N consecutive cycles, the number of failures is recorded as N, and a probe result indicating a failure of the first storage node is generated. If the ith round of circulation receives the response returned by the first storage node, i is more than or equal to 1 and less than N, the failure times are returned to zero, and the failure times are accumulated from 0 under the condition that the response returned by the first storage node is not received.
The storage cluster may be a Redis cluster, other key-value database clusters, or other types of storage clusters.
In one embodiment, the probe request sent by the probe node to the storage node in the storage cluster may be a Ping request, and the response returned by the storage node to the Ping request may be a Pong message.
According to the detection results of the plurality of detection nodes, whether the storage node is failed or not can be judged in a voting mode, and the method specifically comprises the following steps: traversing the detection results of the plurality of detection nodes, judging whether the detection results of at least a preset number of detection nodes indicate that the storage node is normal, and if so, judging that the storage node is normal; otherwise, the storage node is judged to be in failure.
The preset number can be set according to requirements and is preferably set to be 1.
The detection nodes are physical machines, namely physical computers relative to virtual machines, and the detection nodes are deployed in different racks of different machine rooms.
The detection nodes comprise detection programs, and when the detection programs are executed, the detection programs carry out cyclic detection on each storage node in the storage cluster through different network special lines.
The detection nodes report the detection results to a judging program, the judging program can be located in a physical machine different from each detection node, and the judging program judges whether the storage nodes are in fault or not in a voting mode according to the detection results of the detection nodes.
According to the fault detection method of the storage cluster, provided by the embodiment of the invention, fault monitoring is carried out on the storage nodes by deploying a plurality of independent detection programs, so that strong dependence on Zookeeper is avoided; meanwhile, whether the storage node fails or not is comprehensively judged by collecting the detection results of all the detection nodes, so that misjudgment of the storage node failure caused by single reasons such as network failure and the like is avoided; in addition, the detection nodes are deployed on different racks of multiple machine rooms, fault misjudgment is further reduced, and detection reliability is improved.
The method for detecting a failure of a storage cluster according to an embodiment of the present invention is described in detail below by taking a Redis cluster as an example. A schematic diagram of a Redis cluster deployment in an embodiment of the present invention is shown in fig. 2, where a Redis cluster includes 1 Master Redis node, which is referred to as Master, and 1 or multiple Slave Redis nodes, which is referred to as Slave, and in order to ensure high availability of the entire Redis cluster, in the embodiment of the present invention, each Redis node in the Redis cluster is cyclically detected by multiple detection nodes, and for each Redis node, whether the Redis node fails is determined in a voting manner according to detection results of the multiple detection nodes, and when a fault of a certain Redis node is detected, the faulty node is removed from the Redis cluster.
FIG. 3 is a schematic diagram of a storage node probing process according to one embodiment of the invention.
As shown in fig. 3, the storage node probing process according to an embodiment of the present invention includes: connecting the detecting program to the Redis node through the IP and the port of the Redis node, and circularly sending a Ping request to the Redis node; the detection program judges whether a response of the Redis node is received or not, if so, the Redis node is determined to be normal, the failure times are set to be 0, and then a determination result is reported; otherwise, adding 1 to the failure times, judging whether the failure times exceed the specified times, if so, determining that the Redis node has a fault, then reporting a determination result, and if not, returning to the step of circularly sending the Ping request to the Redis node.
The detection program is located in the detection node, when the detection node sends a Ping request to the Redis node in a loop, the Ping request may be sent to the Redis node according to a loop period, where one loop period is a period that the detection node sends a Ping request to all Redis nodes, for example, after one loop, if the detection node does not receive a response from the Redis node, the failure number is added by 1 (an initial value of the failure number is 0), and the Ping request is sent to the Redis node again in the next loop, if the next loop still receives no Ping request sent by the Redis node, the above-mentioned process is repeated until the failure number exceeds a specified number, the Redis node is determined to be faulty, the detection program determines a result to the determination program, and a storage node fault determination process executed by the determination program is described below.
Fig. 4 is a schematic diagram of a storage node failure determination process according to an embodiment of the present invention.
As shown in fig. 4, a storage node failure determination process according to an embodiment of the present invention includes: receiving detection results of the Redis nodes reported by all the detection nodes; and circulating the detection results of each detection node, judging that the Redis node works normally when one detection node in the detection results considers that the Redis node is normal, and otherwise, judging that the Redis node fails.
And if the Redis node is judged to be in fault, notifying a fault switching program, and migrating the fault Redis node so as to remove the fault Redis node from the Redis cluster.
According to the Redis cluster fault detection method provided by the embodiment of the invention, survival detection is carried out on the Redis node by deploying an independent detection program, and whether the Redis node is in fault or not is finally judged by adopting a voting mode, so that misjudgment on the Redis node fault caused by network flash, disconnection and other reasons is avoided.
FIG. 5 is a schematic diagram of the main modules of a failure detection system of a storage cluster according to one embodiment of the present invention.
As shown in fig. 5, a system 500 for detecting a failure of a storage cluster according to an embodiment of the present invention mainly includes: a detection module 501 and a judgment module 502.
The detecting module 501 is configured to perform cyclic detection on each storage node in the storage cluster through a plurality of detection nodes.
The determining module 502 is configured to, for each storage node, determine whether the storage node fails according to the detection results of the multiple detection nodes.
The detection module 501 is specifically configured to: the method comprises the steps that a plurality of detection nodes send detection requests to storage nodes in a storage cluster in a circulating mode, wherein each detection node sends the detection requests to the storage nodes once in each circulation; if the first detection node receives a response to the first detection request returned by the first storage node in a round robin, generating a detection result indicating that the first storage node is normal, wherein the first detection node is one of a plurality of detection nodes, the first storage node is one of the storage nodes of the storage cluster, and the first detection request is a detection request sent by the first detection node to the first storage node; and if the first detection node does not receive the response to the first detection request returned by the first storage node in N continuous cycles, and N is a preset value, generating a detection result indicating the failure of the first storage node.
The determining module 502 may be specifically configured to determine whether the storage node fails in a voting manner, specifically, traverse detection results of a plurality of detection nodes, determine whether the detection results of at least a preset number of detection nodes indicate that the storage node is normal, and if so, determine that the storage node is normal; otherwise, the storage node is judged to be in failure.
A plurality of detection nodes of the detection module 501 perform cyclic detection on each storage node in the storage cluster through different network dedicated lines.
FIG. 6 is a schematic structural diagram of a system for detecting a failure of a storage cluster according to an embodiment of the present invention.
The fault detection system of the storage cluster in one embodiment of the invention is mainly divided into two modules, namely a detection module and a judgment module. The detection module comprises a plurality of detection nodes, and the detection nodes are deployed on different racks of different machine rooms, so that misjudgment caused by the fact that networks of the same machine rooms are not communicated can be avoided. The multiple detection nodes detect the nodes in the Redis cluster, the detection results are reported to the judgment module, and the judgment module comprehensively judges whether the Redis nodes are in fault or not according to the collected detection results. The structure of the fault detection system of the storage cluster is shown in fig. 6, wherein the detection module is not shown in the figure.
The detection node of the detection module performs cyclic detection on the Redis node, and if the number of times exceeds a specified number, the detected Redis node is not returned effectively all the time, the Redis node is considered to be in fault; otherwise, the Redis node is considered to run normally. When the detecting node detects the Redis node, the detecting node sends a Ping request to the Redis node, the Redis node returns a Pong message responding to the Ping request, and the detecting node receives the Pong message and indicates that the Redis node has effective return.
The detection node reports the detection result to the judgment module.
The judging module collects the detection result reported by the detecting module and judges whether the Redis node fails, and the specific judging logic is that the node is normal as long as one detection instance considers that the Redis node is normal.
The fault detection system of the storage cluster in the embodiment of the invention carries out fault monitoring on the Redis node by deploying a plurality of independent detection programs, thereby avoiding strong dependence on Zookeeper; meanwhile, whether the Redis node fails or not is comprehensively judged by collecting detection results of all the detection nodes, so that misjudgment of the Redis node failure caused by single reasons such as network failure and the like is avoided; in addition, the detection nodes are deployed on different racks of multiple machine rooms, fault misjudgment is further reduced, detection reliability is improved, and the high availability of the whole fault detection system is guaranteed.
In addition, in the embodiment of the present invention, the detailed implementation content of the fault detection system of the storage cluster has been described in detail in the above fault detection method of the storage cluster, so that repeated content herein is not described again.
Fig. 7 shows an exemplary system architecture 700 of a failure detection method of a storage cluster or a failure detection system of a storage cluster to which an embodiment of the present invention may be applied.
As shown in fig. 7, the system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the terminal devices 701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The terminal devices 701, 702, 703 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 701, 702, 703. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the method for detecting a failure of a storage cluster provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, a system for detecting a failure of a storage cluster is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing a terminal device or server of an embodiment of the present application. The terminal device or the server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a detection module and a determination module. The names of these modules do not in some cases form a limitation on the modules themselves, for example, a probe module may also be described as a "module for performing cyclic probing on each storage node in a storage cluster through a plurality of probe nodes".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: carrying out cyclic detection on each storage node in the storage cluster through a plurality of detection nodes; and for each storage node, judging whether the storage node fails according to the detection results of the plurality of detection nodes.
According to the technical scheme of the embodiment of the invention, the storage nodes in the storage cluster are circularly detected by the plurality of detection nodes, and whether the storage nodes are in failure or not is judged by voting for each storage node according to the detection results of the plurality of detection nodes. Survival detection can be carried out on the storage cluster nodes without using a Zookeeper server, fault misjudgment on the storage cluster nodes caused by network flash, disconnection and the like is avoided, and the defect that fault detection is unavailable integrally is avoided.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A failure detection method for a storage cluster is characterized by comprising the following steps:
carrying out cyclic detection on each storage node in the storage cluster through a plurality of detection nodes;
and for each storage node, judging whether the storage node fails according to the detection results of the plurality of detection nodes.
2. The method of claim 1, wherein performing cyclic probing on each storage node in the storage cluster through a plurality of probing nodes comprises:
the plurality of detection nodes send detection requests to each storage node in the storage cluster in a circulating mode, wherein each detection node sends the detection request to each storage node once in each circulation;
if a first detection node receives a response to a first detection request returned by a first storage node in a round robin, generating the detection result indicating that the first storage node is normal, wherein the first detection node is one of the plurality of detection nodes, the first storage node is one of the storage nodes, and the first detection request is the detection request sent by the first detection node to the first storage node;
and if the first detection node does not receive a response to the first detection request returned by the first storage node in N successive cycles, wherein N is a preset value, generating a detection result indicating the failure of the first storage node.
3. The method according to claim 1, wherein said determining whether the storage node is failed according to the probing results of the probing nodes comprises:
traversing the detection results of the plurality of detection nodes, judging whether the detection results of at least a preset number of detection nodes indicate that the storage node is normal, and if so, judging that the storage node is normal; otherwise, the storage node is judged to be in failure.
4. The method of claim 1, wherein the plurality of probing nodes perform round-robin probing of each storage node in the storage cluster through different dedicated network lines.
5. A system for failure detection of a storage cluster, comprising:
the detection module is used for circularly detecting each storage node in the storage cluster through a plurality of detection nodes;
and the judging module is used for judging whether the storage node fails or not according to the detection results of the plurality of detection nodes for each storage node.
6. The system of claim 5, wherein the detection module is further configured to:
sending a probe request to each storage node in the storage cluster in a circulating manner through the plurality of probe nodes, wherein each probe node sends the probe request to each storage node once in each circulation;
if a first detection node receives a response to a first detection request returned by a first storage node in a round robin, generating the detection result indicating that the first storage node is normal, wherein the first detection node is one of the plurality of detection nodes, the first storage node is one of the storage nodes, and the first detection request is the detection request sent by the first detection node to the first storage node;
and if the first detection node does not receive a response to the first detection request returned by the first storage node in N successive cycles, wherein N is a preset value, generating a detection result indicating the failure of the first storage node.
7. The system of claim 5, wherein the decision module is further configured to:
traversing the detection results of the plurality of detection nodes, judging whether the detection results of at least a preset number of detection nodes indicate that the storage node is normal, and if so, judging that the storage node is normal; otherwise, the storage node is judged to be in failure.
8. The system of claim 5, wherein the plurality of probing nodes of the probing module perform round-robin probing on each storage node in the storage cluster through different dedicated network lines.
9. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010090855.3A CN113254245A (en) | 2020-02-13 | 2020-02-13 | Fault detection method and system for storage cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010090855.3A CN113254245A (en) | 2020-02-13 | 2020-02-13 | Fault detection method and system for storage cluster |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113254245A true CN113254245A (en) | 2021-08-13 |
Family
ID=77219897
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010090855.3A Pending CN113254245A (en) | 2020-02-13 | 2020-02-13 | Fault detection method and system for storage cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113254245A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114531373A (en) * | 2022-02-25 | 2022-05-24 | 苏州浪潮智能科技有限公司 | Node state detection method, node state detection device, equipment and medium |
CN115499294A (en) * | 2022-09-21 | 2022-12-20 | 上海天玑科技股份有限公司 | Distributed storage environment network sub-health detection and fault automatic processing method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107153595A (en) * | 2016-03-04 | 2017-09-12 | 福建天晴数码有限公司 | The fault detection method and its system of distributed data base system |
CN109951331A (en) * | 2019-03-15 | 2019-06-28 | 北京百度网讯科技有限公司 | For sending the method, apparatus and computing cluster of information |
-
2020
- 2020-02-13 CN CN202010090855.3A patent/CN113254245A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107153595A (en) * | 2016-03-04 | 2017-09-12 | 福建天晴数码有限公司 | The fault detection method and its system of distributed data base system |
CN109951331A (en) * | 2019-03-15 | 2019-06-28 | 北京百度网讯科技有限公司 | For sending the method, apparatus and computing cluster of information |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114531373A (en) * | 2022-02-25 | 2022-05-24 | 苏州浪潮智能科技有限公司 | Node state detection method, node state detection device, equipment and medium |
CN115499294A (en) * | 2022-09-21 | 2022-12-20 | 上海天玑科技股份有限公司 | Distributed storage environment network sub-health detection and fault automatic processing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109714192B (en) | Monitoring method and system for monitoring cloud platform | |
US10048996B1 (en) | Predicting infrastructure failures in a data center for hosted service mitigation actions | |
CN107872402B (en) | Global flow scheduling method and device and electronic equipment | |
CN109257200B (en) | Method and device for monitoring big data platform | |
CN108696581B (en) | Distributed information caching method and device, computer equipment and storage medium | |
CN110830283B (en) | Fault detection method, device, equipment and system | |
JP2014522052A (en) | Reduce hardware failure | |
CN110896362B (en) | Fault detection method and device | |
CN108833205B (en) | Information processing method, information processing device, electronic equipment and storage medium | |
US20160036654A1 (en) | Cluster system | |
CN112217847A (en) | Micro service platform, implementation method thereof, electronic device and storage medium | |
WO2021213171A1 (en) | Server switching method and apparatus, management node and storage medium | |
CN111181765A (en) | Task processing method and device | |
CN113254245A (en) | Fault detection method and system for storage cluster | |
CN112751689B (en) | Network connectivity detection method, monitoring server and monitoring proxy device | |
CN103634167B (en) | Security configuration check method and system for target hosts in cloud environment | |
CN117492944A (en) | Task scheduling method and device, electronic equipment and readable storage medium | |
US8489721B1 (en) | Method and apparatus for providing high availabilty to service groups within a datacenter | |
CN116932505A (en) | Data query method, data writing method, related device and system | |
CN112860505A (en) | Method and device for regulating and controlling distributed clusters | |
CN111831503A (en) | Monitoring method based on monitoring agent and monitoring agent device | |
CN109510730B (en) | Distributed system, monitoring method and device thereof, electronic equipment and storage medium | |
CN111092754B (en) | Real-time access service system and implementation method thereof | |
CN112463514A (en) | Monitoring method and device for distributed cache cluster | |
CN115150253B (en) | Fault root cause determining method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |