CN116668335A - Cluster service processing method, server and system - Google Patents

Cluster service processing method, server and system Download PDF

Info

Publication number
CN116668335A
CN116668335A CN202310598620.9A CN202310598620A CN116668335A CN 116668335 A CN116668335 A CN 116668335A CN 202310598620 A CN202310598620 A CN 202310598620A CN 116668335 A CN116668335 A CN 116668335A
Authority
CN
China
Prior art keywords
node
detection
local
stream
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310598620.9A
Other languages
Chinese (zh)
Inventor
田苗
薛居征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202310598620.9A priority Critical patent/CN116668335A/en
Publication of CN116668335A publication Critical patent/CN116668335A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hardware Redundancy (AREA)

Abstract

The application provides a cluster service processing method, a server and a system, wherein the method is applied to a cluster service system, and the cluster service system comprises a first node and a second node; the method comprises the following steps: under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, acquiring a detection result of local input/output (IO) stream detection; under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection; and determining a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing the cluster service according to the fault reason. The method solves the problem that in the prior art, the RAID card faults in the nodes cannot be detected due to heartbeat network detection, so that a cluster service system cannot switch services from the nodes with faults of the RAID cards to normal nodes for operation.

Description

Cluster service processing method, server and system
Technical Field
The present application relates to the field of database technologies, and in particular, to a method, a server, and a system for processing a cluster service.
Background
The rapid development of digitization and informatization brings great convenience to the production and life of people, and simultaneously, a large amount of data is generated, and the data are stored and analyzed in a database mode in the prior art. In order to ensure continuity of service, a database is usually built in a high availability cluster mode. When detecting that one or more nodes in the high-availability cluster fail, the high-availability cluster can switch the service from the failed node to the node which works normally to operate, so that interruption of the service is avoided.
In the prior art, node switching in highly available clusters typically relied on heartbeat network detection. The heartbeat network detects whether the node fails by monitoring the heartbeat signal of the node in the cluster, and when the heartbeat signal of a certain node in the cluster is not monitored within a designated time, the node is determined to fail, and the service running on the node is switched to a normal node. However, if the redundant array of independent disks (Redundant Array of Independent Disks, abbreviated as RAID) card for controlling data storage fails in the node under the condition that the node does not fail, the IO streams of the plurality of disks managed by the RAID card cannot be read and written normally, so that the database service is affected, and at this time, the node switching is also required. The heartbeat signal of the node can still be monitored because the node is not failed, so that the cluster service system can not switch the service from the node with the fault RAID card to the normal node for operation.
Disclosure of Invention
The embodiment of the application provides a cluster service processing method, a server and a system, which are used for solving the problem that in the prior art, a cluster service system cannot switch services from a node with a fault of a RAID card to a normal node to operate because the fault of the RAID card in the node cannot be detected by heartbeat network detection.
In a first aspect, an embodiment of the present application provides a method for processing a trunking service, where the method is applied to a trunking service system, and the trunking service system includes: a first node and a second node; the method comprises the following steps: under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, acquiring a detection result of local input/output (IO) stream detection; under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection; determining a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing the cluster service according to the fault reason; the local input/output IO stream detection is that the first node detects a first local IO stream of an IO stream between a first RAID card and a first disk in the first node, and the second node detects a second local IO stream of an IO stream between a second RAID card and a second disk in the second node; the opposite-end IO stream detection is a first opposite-end IO stream detection of the first node to the IO stream between the second RAID card and the second disk in the second node, and/or a second opposite-end IO stream detection of the second node to the IO stream between the first RAID card and the first disk in the first node.
In the embodiment of the application, by introducing the IO stream detection service, each node respectively detects the local IO stream, and under the condition of abnormal detection result, each node detects the IO stream of the opposite terminal. Because the local IO stream detection service abnormality may be caused by the local IO stream detection service abnormality or the soft failure of the local RAID card, under the condition of the local IO stream detection abnormality, whether the local RAID card has the soft failure or not can be further judged by detecting the opposite end IO stream, thereby solving the problem that when a disk cannot be read and written normally, the system is dying without switching the main node and the standby node of the cluster service system, and improving the reliability of the cluster service system.
In a specific embodiment, obtaining a detection result of local IO flow detection includes: the first node and the second node respectively initiate a first data reading instruction to a disk of a local node; the first data reading instruction is used for reading first data in a disk in the local node; acquiring first data returned by a disk of the local node based on the first data reading instruction; the detection result of the local IO stream detection is abnormal, including: and the first data returned by the disk of the local node is not acquired.
In the above embodiment, the first node and the second node respectively send data reading instructions to the local disks, and return data based on the data reading instructions, if the data in the local disk can be obtained, the local IO stream detection result is normal, and if the data in the local disk is not obtained, the local IO stream detection result is abnormal, and by the above method, the detection of the local IO stream by each node is realized.
In a specific embodiment, obtaining the first data returned by the disk of the local node based on the first read data instruction includes: acquiring first data returned by the disk of the local node based on the first data reading instruction according to a preset time interval point in a preset first time period; the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: if the sum of the times of returning the first data from the disk of the local node is smaller than a first time threshold value in the preset first time period according to the preset time interval point, determining that the detection result of the local IO stream detection is abnormal.
In the above embodiment, each node reads the data in the disk of the local node at intervals in a preset first time period, if the total number of times of acquiring the first data returned by the disk of the local node is greater than or equal to the first time threshold in the preset first time period, it may be determined that the detection result of the local IO stream detection is normal, otherwise, it is abnormal, and the number of times of successfully acquiring the first data is determined in a preset time period, so that the determination of the detection result of the IO stream detection is more accurate and reliable.
In one embodiment, obtaining first data returned by a disk of a local node based on a first read data instruction includes: acquiring first data returned by a disk of the local node based on a first data reading instruction in a preset second time period; the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: if the difference between the acquired time of acquiring the first data returned by the disk of the local node and the preset second time is greater than a first time threshold, determining that the detection result of the local IO stream detection is abnormal; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
In the above embodiment, the timeout time of each node for acquiring the first data is determined in the preset second time period, if the timeout time is greater than the first time threshold, it is determined that the local IO flow detection result is abnormal, otherwise, the local IO flow detection result is normal.
In a specific embodiment, obtaining the first data returned by the disk of the local node based on the first read data instruction includes: acquiring first data returned by a disk of a local node based on a first data reading instruction in a preset first time period; the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: in the preset first time period, the times that the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is larger than a first time threshold value are larger than a second time threshold value; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
In the above embodiment, the detection result of the local IO stream is determined by determining the timeout times of the first data acquired by each node in the first time period, so that the reliability and accuracy of the detection are further improved.
In a specific embodiment, the obtaining a detection result of the detection of the opposite-end IO stream includes: the first node or the second node initiates a second read data instruction to a disk of the opposite-end node; the second read data instruction is used for reading second data in a disk of the opposite end node; acquiring second data returned by the disk of the opposite end node based on the second read data instruction; if the second data returned by the disk of the opposite terminal node is obtained, determining that the detection result of the opposite terminal IO stream detection is normal; and if the second data returned by the disk of the opposite terminal node is not obtained, determining that the detection result of the opposite terminal IO stream detection is abnormal.
In the above embodiment, each node sends a read data instruction to a disk of the opposite node, and if the second data returned by the opposite node can be obtained, it is determined that the detection result of the opposite IO stream is normal, otherwise, it is abnormal. By increasing the detection of the opposite-end IO stream, the detection result of the local IO stream can be further verified, and the accuracy and reliability of fault judgment are enhanced.
In a specific embodiment, the determining the failure cause based on the detection result of the opposite-end IO flow detection and the detection result of the local IO flow detection includes: if the detection result of the first local IO stream detection is abnormal and the detection result of the first opposite-end IO stream detection is normal, determining that the RAID state of the first node is abnormal; if the detection result of the second local IO stream detection is abnormal and the detection result of the second opposite-end IO stream detection is normal, determining that the RAID state of the second node is abnormal; if the detection result of the first local IO stream detection is abnormal, the detection result of the first opposite end IO stream detection is abnormal, and the detection result of the second local IO stream detection is normal, determining that the IO stream detection service of the first node is abnormal; and if the detection result of the second local IO stream detection is abnormal, the detection result of the second opposite end IO stream detection is abnormal, and the detection result of the first local IO stream detection is normal, determining that the IO stream detection service of the second node is abnormal.
In the above embodiment, when the local IO flow detection result is abnormal, by acquiring the detection result of the opposite end IO flow and combining with the local IO flow detection result, the fault cause is determined, so that accurate determination of the fault of the trunking service system can be realized.
In one embodiment, the method further comprises: when the heartbeat detection result of the first node is abnormal and the heartbeat detection result of the second node is normal, determining that the heartbeat network of the first node is abnormal; when the heartbeat detection result of the first node is normal and the heartbeat detection result of the second node is abnormal, determining that the heartbeat network of the second node is abnormal; and when the heartbeat detection results of the first node and the second node are abnormal, determining that the heartbeat networks of the first node and the second node are abnormal.
In the embodiment, besides the system false death condition of each node, other faults may occur in the cluster service system, and by detecting the heartbeat network of each node and judging whether the heartbeat network of each node is normal according to the detection result, the fault judgment of the cluster service system is more accurate.
In a specific embodiment, the processing the cluster service according to the failure cause includes: the cluster service system is in a hot standby scene, wherein the first node is a main node, and the second node is a standby node; when the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, switching the cluster service from the first node to the second node for operation, and carrying out alarm processing on the RAID card soft fault of the first node; when the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, carrying out alarm processing on the soft fault of the RAID card of the second node; when the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, carrying out alarm processing on the RAID card soft faults of the first node and the second node; and when the IO stream detection service of the first node and/or the second node is abnormal, carrying out alarm processing of the IO stream detection fault of the first node and/or the second node.
In the embodiment, for the system in the hot standby scene, based on each fault cause, the cluster service system is correspondingly processed, so that the running reliability of the system is improved.
In a specific embodiment, the processing the cluster service according to the failure cause includes: the cluster service system is in a double-activity scene, the first node and the second node are standby nodes, a first cluster service is operated on the first node, and a second cluster service is operated on the second node; when the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, switching the first cluster service from the first node to the second node for operation, and carrying out alarm processing on the soft fault of the RAID card of the first node; when the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, switching the second cluster service from the second node to the first node for operation, and carrying out alarm processing on the soft fault of the RAID card of the second node; when the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, carrying out alarm processing on the RAID card soft faults of the first node and the second node; and when the IO stream detection service of the first node and/or the second node is abnormal, carrying out alarm processing of the IO stream detection fault of the first node and/or the second node.
In the embodiment, for the system in the dual-activity scene, based on each fault reason, the cluster service system is correspondingly processed, so that the running reliability of the system is improved.
In a second aspect, an embodiment of the present application provides a server, including: a processor, a memory, a communication interface; the memory is used for storing executable instructions of the processor; wherein the processor is configured to perform the cluster service processing method of the first aspect via execution of the executable instructions.
In a third aspect, an embodiment of the present application provides a trunking service system, including: at least one first node and at least one second node, wherein the first node is a main node, and the second node is a standby node; the first node executes the cluster service processing method in the first aspect.
The embodiment of the application provides a cluster service processing method, a server and a system, wherein the method is applied to a cluster service system, and the cluster service system comprises a first node and a second node; the method comprises the following steps: under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, acquiring a detection result of local input/output (IO) stream detection; under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection; determining a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing the cluster service according to the fault reason; the local input/output IO stream detection is that the first node detects a first local IO stream of an IO stream between a first RAID card and a first disk in the first node, and the second node detects a second local IO stream of an IO stream between a second RAID card and a second disk in the second node; the opposite-end IO stream detection is a first opposite-end IO stream detection of the first node to the IO stream between the second RAID card and the second disk in the second node, and/or a second opposite-end IO stream detection of the second node to the IO stream between the first RAID card and the first disk in the first node. Compared with the prior art that the service is switched from the fault node to the normal node by means of heartbeat network detection, the method and the device for switching the service from the fault node to the normal node according to the local IO stream detection and the opposite IO stream detection of the first node and the second node, and the heartbeat detection results of the first node and the second node are combined, so that the fault cause of the service cluster is determined, and the cluster service is processed according to the fault cause, so that the service cluster can timely switch the service from the fault node to the normal node when the service needs to be switched due to RAID card soft faults, and the problem that the cluster service system cannot switch the service from the fault node to the normal node to operate due to the failure of RAID card in the node due to the failure of the detection of the heartbeat network in the prior art is solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
Fig. 1 is a schematic structural diagram of a trunking service system;
fig. 2 is a schematic flow chart of a first embodiment of a trunking service processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of IO stream detection performed by a trunking service system;
fig. 4 is a schematic flow chart of a second embodiment of a trunking service processing method according to an embodiment of the present application;
fig. 5 is a schematic flow chart of a third embodiment of a trunking service processing method according to the embodiment of the present application;
fig. 6 is a schematic flow chart of a fourth embodiment of a trunking service processing method according to an embodiment of the present application;
fig. 7 is a schematic flow chart of a fifth embodiment of a trunking service processing method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server embodiment according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of another embodiment of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which are made by a person skilled in the art based on the embodiments of the application in light of the present disclosure, are intended to be within the scope of the application.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, the terms involved in the present application will be explained:
RAID card: redundant array of independent disks (Redundant Array of Independent Disks, simply RAID) combines multiple independent disks into one large capacity disk group. RAID cards manage the plurality of disks that make up the disk array. When the database service is realized, the operating system needs to read and write data from and to the magnetic disk managed by the operating system through the RAID card.
Fig. 1 is a schematic structural diagram of a trunking service system. The cluster service system comprises a first node 11 and a second node 12. Wherein the first node 11 is a primary node, and the second node 12 is a standby node. Heartbeat network detection is performed between the first node 11 and the second node 12 via a heartbeat network link 13. The heartbeat network detects whether a failure fault occurs in a node in the cluster mainly by monitoring a heartbeat signal of the node, specifically, heartbeat messages are mutually sent between the first node 11 and the second node 12 at a fixed frequency through a heartbeat network link 13, heartbeat messages of opposite end nodes are received, if the second node 12 does not receive the heartbeat messages of the first node 11 within a specified time, it is determined that the first node 11 fails, and then the service running on the first node 11 is switched to the second node 12.
As shown in fig. 1, each of the first node 11 and the second node 12 includes a processor running an operating system, a RAID card, and a plurality of disks managed by the RAID card. To realize the database service, the operating system needs to read and write data from and to the disk managed by the operating system through the RAID card. If a RAID card for controlling data reading and writing in a node fails, for example, a program of the RAID card runs wrong, the IO streams of a plurality of disks managed by the RAID card cannot be read and written normally, so that the database service is affected, and at this time, node switching is also required. The heartbeat signal of the node can still be monitored because the node is not failed, so that the cluster service system can not switch the service from the node with the fault of the RAID card to the normal node.
Based on the technical problems, the technical conception process of the application is as follows: how to detect the RAID card failure of the node to switch the cluster service from the node with the RAID card failure to the normal node.
The technical scheme of the application is described in detail through specific embodiments. It should be noted that the following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 2 is a schematic flow chart of a first embodiment of a cluster service processing method according to an embodiment of the present application. The method is applied to a cluster service system, and the cluster service system comprises a first node and a second node. Referring to fig. 2, the cluster service processing method specifically includes the following steps:
step S201: and under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, acquiring a detection result of the local input/output IO stream detection.
In one example, the cluster service system is in a hot standby scenario, and includes a first node and a second node, where the first node is a primary node and the second node is a standby node. The database service is run on a first node, and when the first node fails, the database service is switched from the first node to run on a second node. The cluster service system may further comprise a management node. The first node and the second node plug-in the storage device. Fig. 3 is a schematic diagram of IO flow detection performed by the trunking service system.
The local IO stream detection is a first local IO stream detection of the first node on the IO stream between the first RAID card and the first disk in the first node, and/or a second local IO stream detection of the second node on the IO stream between the second RAID card and the second disk in the second node. As shown in fig. 3, the first local IO flow detection is the detection of the IO flow between the first RAID card and the first disk in the first node by the first node; the second local IO stream detection is a detection of IO streams between a second RAID card and a second disk in the second node by the second node. The local IO stream detection is mainly used for detecting whether the local system disk has the problem of incapability of normal reading and writing.
In the database service cluster, a first node is a main node, and a second node is a standby node. The first node and the second node each comprise a RAID card and a plurality of disks managed by the RAID card. Wherein the disk may be a system disk. The database service is realized, the operation system needs to read and write data through the RAID card to the disk managed by the operation system, and IO flow is formed between the RAID card and the disk in the process of reading and writing data. Therefore, it is possible to determine whether a soft failure such as a program operation error occurs in the RAID card by detecting the IO stream between the RAID card and the disk. When the detection result of the IO stream between the RAID card and the disk is abnormal, the RAID card can be determined to have soft faults such as program running errors and the like.
In this embodiment, under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, a detection result of local input/output (IO) stream detection is obtained. Specifically, a first node detects IO streams between a first RAID card and a first disk in the first node to obtain a detection result of first local IO stream detection; and the second node detects the IO stream between the second RAID card and the second disk in the second node to obtain a detection result of the second local IO stream detection. The first node is used as a master node to obtain detection results of the first local IO stream detection and the second local IO stream detection, and the management node can also obtain detection results of the first local IO stream detection and the second local IO stream detection.
Step S202: and under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection.
The opposite-end IO stream detection is a first opposite-end IO stream detection of the IO stream between the second RAID card and the second disk in the second node by the first node, and/or a second opposite-end IO stream detection of the IO stream between the first RAID card and the first disk in the first node by the second node. As shown in fig. 3, the first peer IO stream detection is that the first node detects an IO stream between the second RAID card and the second disk in the second node; the second opposite-end IO stream detection is the detection of the second node to the IO stream between the first RAID card and the first disk in the first node. The opposite-end IO stream detection is used for checking results and guaranteeing consistency and reliability of the results, and is used for checking whether the IO stream detection service has abnormality or not.
In this embodiment, when the detection result of the local IO flow detection is abnormal, the detection result of the opposite-end IO flow detection is obtained. Specifically, the first node detects IO streams between a second RAID card and a second disk in the second node to obtain a detection result of the first opposite-end IO stream detection; the second node also detects the IO stream between the first RAID card and the first disk in the first node, and a detection result of the second opposite-end IO stream detection is obtained. The first node is used as a main node to acquire detection results of the first opposite-end IO stream detection and the second opposite-end IO stream detection, and the management node can also acquire detection results of the first opposite-end IO stream detection and the second opposite-end IO stream detection.
In this embodiment, the first node and the second node further perform heartbeat detection, send heartbeat messages with each other at a fixed frequency, and receive the heartbeat messages of the opposite node to obtain a heartbeat detection result. The first node is used as a master node to acquire the heartbeat detection result of the first node and the heartbeat detection result of the second node, and the management node can also acquire the heartbeat detection result of the first node and the heartbeat detection result of the second node. The above steps S201 to S202 are performed in the case where the heartbeat network of the first node and the heartbeat network of the second node are normal.
Step S203: and determining a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing the cluster service according to the fault reason.
In this embodiment, the first node or the management node determines the cause of the failure based on the detection result of the opposite-end IO flow detection and the detection result of the local IO flow detection.
In an exemplary embodiment, on the premise that the heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is normal, it may be preliminarily determined that the RAID card state of the first node is abnormal, and the detection result of the second local IO flow detection is normal, which indicates that the IO detection service of the second node is normal, and when the detection result of the second peer IO flow detection is abnormal, it may be further determined that the RAID state of the first node is abnormal, and it is determined that the RAID state of the second node is normal.
The first node is a main node, the second node is a standby node, the RAID state of the first node is abnormal and the RAID state of the second node is normal, the cluster service needs to be switched, and the cluster service running on the first node is switched to the second node to run.
In one example, a cluster service system is in a dual-active scenario, including a first node and a second node, the first node and the second node being standby nodes with respect to each other. The first node runs database service A and the second node runs database service B. When the first node fails, the database service A is switched from the first node to the second node for operation; when the second node fails, the database service B is switched from the second node to operate on the first node. The cluster service system may further comprise a management node.
Under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, the first node, the second node or the management node can acquire the detection result of the local input/output IO stream detection. And under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection. The first node, the second node or the management node determines a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processes the cluster service according to the fault reason.
In an example, when the database service is started, detection results of the first local IO flow detection and the first opposite end IO flow detection, detection results of the second local IO flow detection and the second opposite end IO flow detection, and heartbeat detection results of the first node and the second node may be obtained, and RAID card states of the first node and the second node are respectively determined according to the detection results of the first local IO flow detection, the first opposite end IO flow detection, the second local IO flow detection and the second opposite end IO flow detection, and the heartbeat detection results of the first node and the second node, so as to determine whether to switch the cluster service.
In an example, detection results of the first local IO flow detection and the first peer IO flow detection, detection results of the second local IO flow detection and the second peer IO flow detection, and heartbeat detection results of the first node and the second node may be obtained again at intervals of a preset time after the database service is started, and RAID card states of the first node and the second node are determined respectively to determine whether to switch the cluster service according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node.
In an example, detection results of the first local IO flow detection and the first opposite end IO flow detection, detection results of the second local IO flow detection and the second opposite end IO flow detection, and heartbeat detection results of the first node and the second node may be obtained multiple times within a preset time, and after determining RAID card states of the first node and the second node, the number of times that the RAID card states of the first node and the second node are abnormal is counted, and whether to switch cluster services is determined according to the number of times that the RAID card states of the first node and the second node are abnormal. The first node is a primary node, the second node is a standby node, and when the number of abnormal states of the RAID card of the first node exceeds a number threshold, and the number of abnormal states of the RAID card of the second node is 0, that is, when the detection results of multiple times of the RAID card of the second node are all in normal states, the cluster service is switched.
In this embodiment, the trunking service system includes a first node and a second node; under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, acquiring a detection result of local input/output (IO) stream detection; under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection; determining a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing the cluster service according to the fault reason; the local IO stream detection is a first local IO stream detection of the first node on the IO stream between the first RAID card and the first disk in the first node, and/or a second local IO stream detection of the second node on the IO stream between the second RAID card and the second disk in the second node; the opposite-end IO stream detection is a first opposite-end IO stream detection of the first node to the IO stream between the second RAID card and the second disk in the second node, and/or a second opposite-end IO stream detection of the second node to the IO stream between the first RAID card and the first disk in the first node. Compared with the prior art that the service is switched from the fault node to the normal node by means of heartbeat network detection, the method and the device for switching the service from the fault node to the normal node according to the local IO stream detection and the opposite IO stream detection of the first node and the second node, and the heartbeat detection results of the first node and the second node are combined, so that the fault cause of the service cluster is determined, and the cluster service is processed according to the fault cause, so that the service cluster can timely switch the service from the fault node to the normal node when the service needs to be switched due to RAID card soft faults, and the problem that the cluster service system cannot switch the service from the fault node to the normal node to operate due to the failure of RAID card in the node due to the failure of the detection of the heartbeat network in the prior art is solved.
Fig. 4 is a schematic flow chart of a second embodiment of a cluster service processing method according to an embodiment of the present application, and based on the embodiment shown in fig. 2, the step S201 may include: the first node and the second node respectively initiate a first data reading instruction to a disk of the local node; the first data reading instruction is used for reading first data in a disk in the local node; acquiring first data returned by a disk of a local node based on the first data reading instruction; the detection result of the local IO stream detection is abnormal, and the method comprises the following steps: the first data returned by the disk of the local node is not acquired.
Specifically, step S201 may include the steps of:
step S401: the first node initiates a first data reading instruction to a first disk in the first node; the first data reading instruction is used for reading first data in the first magnetic disk.
Step S402: the first node obtains first data returned by the first disk based on the first data reading instruction; if the first data returned by the first disk is obtained, determining that the detection result of the first local IO stream detection is normal; if the first data returned by the first disk is not obtained, determining that the detection result of the first local IO stream detection is abnormal.
In this embodiment, the RAID card manages a plurality of disks. When the database service is realized, the operating system needs to read and write data through the RAID card to the disk managed by the operating system, and IO flow is formed between the RAID card and the disk in the process of reading and writing data. Thus, it may be determined whether a soft failure of the RAID card occurs by detecting the IO flow between the RAID card and the disk. When the RAID card has soft faults such as program running errors, the disk does not respond to the read instruction to return data, so that IO flow between the RAID card and the system disk can be detected through the read data instruction of the disk, and whether the RAID card has soft faults or not is further determined. For example, an underlying IO test tool, such as disk stress test (Flexible Input Output tester, FIO for short), IO test software (Input Output meter, IOmeter for short), etc., may be invoked to read the disk.
Specifically, when the first node detects the first local IO stream, a first data reading instruction is initiated to a first disk in the first node, wherein the first data reading instruction is used for reading first data in the first disk managed by the first RAID card, and if the RAID card in the first node is in a normal state, the first disk returns the first data to an operating system of the first node based on the first data reading instruction; if the RAID card in the first node has a soft failure, the first disk will not return data.
The first node acquires first data returned by the first disk based on the first data reading instruction, and if the first data returned by the first disk is successfully acquired, the detection result of the first local IO stream detection is determined to be normal; if the first data returned by the first disk is not obtained, determining that the detection result of the first local IO stream detection is abnormal.
On the basis of the embodiment shown in fig. 2, the step S202 may include: the first node or the second node initiates a second read data instruction to a disk in the opposite node; the second read data instruction is used for reading second data in a disk in the opposite end node; acquiring second data returned by the disk in the opposite end node based on the second read data instruction; if the second data returned by the disk in the opposite terminal node is obtained, determining that the detection result of the opposite terminal IO stream detection is normal; and if the second data returned by the disk in the opposite terminal node is not obtained, determining that the detection result of the opposite terminal IO stream detection is abnormal.
Specifically, step S202 may include the steps of:
step S403: the first node initiates a second read data instruction to a second disk in the second node; the second read data command is used for reading second data in the second magnetic disk.
Step S404: the first node obtains second data returned by the second disk based on the second read data instruction; if the second data returned by the second disk is obtained, determining that the detection result of the first opposite-end IO stream detection is normal; if the second data returned by the second disk is not obtained, determining that the detection result of the first opposite-end IO stream detection is abnormal.
Note that the execution order of steps S402 to S404 is not particularly limited here.
In this embodiment, the first node further performs IO flow detection on the second node of the opposite end, i.e. the first opposite end IO flow detection. Specifically, a second read data instruction is initiated to a second disk in the second node, the second read data instruction being for reading second data in the second disk. If the RAID card in the second node is in a normal state, the second disk returns second data to the operating system of the first node based on the second read data instruction; if the RAID card in the second node has a soft failure, the second disk will not return data.
The first node acquires second data returned by the second disk based on the second read data instruction; if the second data returned by the second disk is successfully obtained, determining that the detection result of the first opposite-end IO stream detection is normal; if the second data returned by the second disk is not obtained, determining that the detection result of the first opposite-end IO stream detection is abnormal.
In this embodiment, the attribute parameters of the IO flow detection include: interval time, timeout times, etc. For example, the first node may perform local IO flow detection and peer IO flow detection at preset intervals. For example, the first node obtains first data returned by the first disk according to a preset time interval node in a preset first time period, and if each time interval point can obtain the first data returned by the first disk in the preset first time period, it is determined that a detection result of the first local IO stream detection is normal; or, in the preset first time period, the first node determines that the detection result of the first local IO stream detection is normal if the number of times of acquiring the first data returned by the first disk by all time interval points is greater than or equal to a first time threshold; if the first node cannot acquire the first data returned by the first disk in the preset first time period, determining that the detection result of the first local IO stream is abnormal; or the first node obtains the first data returned by the first disk from all time interval points in the preset first time period, and if the times of the first data returned by the first disk are smaller than a first time threshold value, the detection result of the first local IO stream detection is determined to be abnormal; the first time threshold is the number of times that the first node can acquire the first data returned by the first disk at least in the preset time period; the interval time may be 1s,5s,1min,1h, etc., and is not particularly limited herein.
In some implementations, if the difference between the acquisition time of the first data returned by the first disk acquired by the first node and the preset first time is smaller than a first time threshold, determining that the detection result of the first local IO stream detection is normal; if the difference between the acquisition time of the first data returned by the first disk acquired by the first node and the preset second time is larger than a first time threshold, determining that the detection result of the first local IO stream detection is abnormal, wherein the second time is the maximum time when the preset first node can normally acquire the first data returned by the first disk;
in some implementations, in a preset first time period, when the first node obtains that the timeout times of the first data returned by the first disk are greater than a second time threshold, determining that a detection result of the first local IO stream detection is abnormal; when the first node obtains that the timeout times of the first data returned by the first disk are smaller than or equal to the second time threshold, the detection result of the first local IO stream detection is determined to be normal. The timeout times are the maximum times when the difference between the acquisition time of the first data returned by the first disk acquired by the first node and the preset second time is greater than the first time threshold.
Similarly, the second node performs the second local IO stream detection and the second peer IO stream detection, which may be performed with reference to steps S401 to S404.
In this embodiment, the first node performs the first local IO stream detection and the first peer IO stream detection by respectively initiating a data reading instruction to the first disk and the second disk in the first node and reading data from the first disk and the second disk, so as to detect the running states of the RAID card of the local node and the RAID card of the peer node respectively. Therefore, the IO streams of the local node and the opposite terminal node can be detected under the condition that the database service is not affected, so as to determine whether the RAID has soft faults, and a precondition is provided for timely switching the service from the fault node to the normal node when the service cluster needs to be switched due to the soft faults of the RAID card.
Fig. 5 is a schematic flow chart of a third embodiment of a cluster service processing method according to an embodiment of the present application, and based on the embodiments shown in fig. 2 to fig. 4, the step S203 specifically includes the following steps:
step S501: and respectively determining RAID card states of the first node and the second node according to detection results of the first local IO stream detection, the first opposite-end IO stream detection, the second local IO stream detection and the second opposite-end IO stream detection and heartbeat detection results of the first node and the second node.
One of the first node and the second node is a main node, and the other node is a standby node.
In this embodiment, according to the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, the RAID card states of the first node and the second node are determined.
The local IO stream detection is mainly used for detecting whether a local system disk has the problem of incapability of normal reading and writing; the opposite-end IO stream detection is used for checking the local IO stream detection result of the opposite-end node, ensuring the consistency and reliability of the result and checking whether the IO stream detection service has abnormality or not.
Specifically, when the heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO stream detection is abnormal, and the detection result of the first opposite end IO stream detection is normal, determining that the RAID state of the first node is abnormal. The heartbeat detection results of the first node and the second node are normal, and the network of the first node and the second node is normal; if the detection result of the first local IO stream detection is abnormal, it is indicated that there is a possibility that the local disk of the first node cannot read and write normally (or the IO stream detection service of the first node is abnormal), and in order to further verify whether the local disk of the first node cannot read and write normally, it is necessary to determine the detection result of the first opposite end IO stream, if the detection result of the first opposite end IO stream is normal, the situation that the IO stream detection service of the first node is abnormal may be eliminated, so that the RAID state of the first node may be determined to be abnormal.
When the heartbeat detection results of the first node and the second node are normal, the RAID card of the first node and the IO detection service of the first node can be primarily judged to be normal when the detection results of the first local IO stream detection are normal, in order to further verify the conclusion, the judgment can be carried out through the result of the second opposite end IO stream detection, and if the detection result of the second opposite end IO stream detection is normal, the RAID state of the first node is determined to be normal.
Similarly, the heartbeat detection results of the first node and the second node are normal, the first local IO stream is detected normally, the second opposite end IO stream is detected normally, if the detection result of the first opposite end IO stream is abnormal, the first opposite end IO stream can be judged preliminarily, the second node RAID card is abnormal, and at the moment, if the detection result of the second local IO stream is abnormal, the RAID state of the second node is further determined to be abnormal. The heartbeat detection results of the first node and the second node are normal, and the network of the first node and the second node is normal; the first local IO stream detection is normal, the second opposite end IO stream detection is normal, the IO stream detection services of the first node and the second node can be respectively described as normal, at this time, if the detection result of the second local IO stream detection is abnormal, the problem that the local disk of the second node cannot be read and written normally can be primarily judged, in order to further verify the conclusion, the detection result of the first opposite end IO stream detection is used for judging, if the detection result is abnormal, the RAID card state of the second node is abnormal, and the problem that the local disk of the second node cannot be read and written normally is further verified. Thus, it may be further determined that the RAID state of the second node is abnormal.
And if the detection results of the second local IO stream detection are normal, the reading and writing of the local disk of the second node can be primarily judged, and in order to further verify the conclusion, the detection results of the first opposite-end IO stream detection are verified, and if the detection results are normal, the RAID state of the second node is determined to be normal.
Specifically, when the heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO flow detection is abnormal, the detection result of the first opposite end IO flow detection is abnormal, the detection result of the second local IO flow detection is normal, and the detection result of the second opposite end IO flow detection is normal, the IO flow detection service of the first node is determined to be abnormal. The heartbeat detection results of the first node and the second node are normal, and the network of the first node and the second node is normal; the detection result of the first local IO stream detection is abnormal, which indicates that the local disk of the first node may have the problem of incapability of normal reading and writing (also may be the problem of abnormal IO stream detection service of the first node), but the detection result of the second opposite end IO stream detection is normal, which indicates that the local system disk of the first node can read and write normally, at this time, the condition of abnormal RAID card state of the first node can be eliminated, and it can be determined that the abnormality of the detection result of the first local IO stream detection is the abnormality of IO stream detection service of the first node; the detection result of the second local IO stream detection is normal, which indicates that the local disk of the second node can read and write normally, but the detection result of the first opposite end IO stream detection is abnormal, and the IO stream detection abnormality of the first node is further verified.
Similarly, when the heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO stream detection is normal, the detection result of the first opposite end IO stream detection is normal, the detection result of the second local IO stream detection is abnormal, and the detection result of the second opposite end IO stream detection is abnormal, the IO stream detection service of the second node is determined to be abnormal.
In some implementations, the network conditions of the first node and the second node may also affect the detection result of the peer IO flow detection. When the heartbeat detection result of the first node and/or the second node is abnormal, the detection result of the opposite-end IO stream detection is also abnormal.
Step S502: when the RAID card state of the main node is abnormal and the RAID card state of the standby node is normal, switching cluster service from the main node to the standby node for operation, and carrying out alarm processing on the soft fault of the RAID card of the main node.
Step S503: when the cluster service system is in a hot standby scene, determining that the RAID card state of the standby node is abnormal, and when the RAID card state of the main node is normal, carrying out alarm processing on the soft fault of the RAID card of the standby node;
when the cluster service system is in a double-activity scene, determining that the RAID card state of the standby node is abnormal and the RAID card state of the main node is normal, switching the cluster service from the standby node to the main node for operation, and carrying out alarm processing on the RAID card soft fault of the standby node.
Step S504: and when the RAID card state of the main node is abnormal and the RAID card state of the standby node is abnormal, carrying out alarm processing on the RAID card soft faults of the main node and the standby node.
Step S505: and when the detection abnormality of the IO streams of the main node and/or the standby node is determined, carrying out alarm processing of the detection fault of the IO streams of the main node and/or the standby node.
Specifically, the IO flow detection is abnormal, which indicates that the IO test tool, such as the FIO, the IOmeter, etc., has a fault, and the IO flow detection fault needs to be alerted, so that the manager processes the IO flow detection fault.
In one example, the cluster service system is in a hot standby scenario and comprises a first node and a second node, wherein the first node is a main node, the second node is a standby node, the database service runs on the first node, and when the first node fails, the database service is switched from the first node to the second node to run. The cluster service system may also include a management node. The first node or the management node respectively determines RAID card states of the first node and the second node according to detection results of the first local IO stream detection, the first opposite end IO stream detection, the second local IO stream detection and the second opposite end IO stream detection and heartbeat detection results of the first node and the second node, and processes cluster service according to the RAID card states of the first node and the second node. The first node is a main node, and the second node is a standby node. The specific modes are shown in the following table 1:
TABLE 1 IO stream detection results and processing schemes of first node and second node in hot standby scenario
/>
In one example, a cluster service system is in a dual-active scenario, including a first node and a second node, the first node and the second node being standby nodes with respect to each other. The first cluster service is operated on the first node, and the second cluster service is operated on the second node. When the first node fails, the first cluster service is switched from the first node to the second node to run; when the second node fails, the second cluster service is switched from the second node to the first node for operation. The cluster service system may further comprise a management node.
The first node, the second node or the management node acquires detection results of the first local IO stream detection and the first opposite-end IO stream detection, detection results of the second local IO stream detection and the second opposite-end IO stream detection, and heartbeat detection results of the first node and the second node. And the first node, the second node or the management node respectively determines RAID card states of the first node and the second node according to the detection results of the first local IO stream detection, the first opposite end IO stream detection, the second local IO stream detection and the second opposite end IO stream detection and the heartbeat detection results of the first node and the second node, and processes the cluster service according to the RAID card states of the first node and the second node. The specific modes are shown in the following table 2:
Table 2 IO stream detection results and processing schemes of first node and second node in dual-activity scene
/>
The RAID state of the first node is abnormal and the RAID state of the second node is normal under the double-activity scene. Fig. 6 is a schematic flow chart of a fourth embodiment of a trunking service processing method according to an embodiment of the present application. As shown in fig. 6, the first node and the second node are standby nodes, and the first node runs database service a and the second node runs database service B. The heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO stream detection is abnormal, the detection result of the first opposite end IO stream detection is normal, the detection result of the second local IO stream detection is normal, the detection result of the second opposite end IO stream detection is abnormal, therefore, the system disk reading abnormality of the first node is determined, the system false death condition exists, and the database service A needs to be switched from the first node to the second node.
In addition, since the soft failure of the RAID card may cause a problem in operation of the operating system, and further cause that the cluster service cannot be switched normally, it is required to determine whether the cluster service is switched successfully. If the switching is successful, ending the switching flow; if the switching fails, the first node can restart the second node through the out-of-band management channel, so that the cluster service is switched from the second node to the first node. Illustratively, the out-of-band management channel may be an intelligent platform management interface protocol (Intelligent Platform Management Interface, IPMI) channel. Fig. 7 is a schematic flow chart of a fifth embodiment of a cluster service processing method according to an embodiment of the present application.
As shown in FIG. 7, the embodiment of the present application performs IO stream detection by calling the underlying IO test tools, such as FIO, IOmeter, etc., to periodically read the system disk. Illustratively, if the result is returned, it is normal; if the time-out is not responded, the abnormal reading of the system disk is determined, the system is in a false death condition, and the result is needed to be synchronized to the node where the cluster service is located so as to realize the switching of the cluster service to the normal node operation, thereby ensuring the service continuity. If the node where the cluster service is located cannot realize switching under the system with the dying state, a network channel, such as an IPMI channel, can be established through out-of-band management of the normal node and the fault node, so that restarting of the fault node is realized, and switching of the cluster service is realized.
In this embodiment, the first node determines the RAID card states of the first node and the second node according to the detection results of the first local IO flow detection, the first peer end IO flow detection, the second local IO flow detection, and the second peer end IO flow detection, and the heartbeat detection results of the first node and the second node, and processes the cluster service according to the RAID card states of the first node and the second node, so that when the service needs to be switched due to the RAID card soft failure, the service can be timely switched from the failed node to the normal node, and an alarm of an abnormal condition is implemented. Further, the problem that in the prior art, due to the fact that soft faults of the RAID card in the node cannot be detected through heartbeat network detection, service cannot be switched from the node with the fault of the RAID card to the normal node is solved.
The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.
Fig. 8 is a schematic structural diagram of a server embodiment according to an embodiment of the present application; as shown in fig. 8, the server 60 includes: the acquisition module 61 and the processing module 62. The acquiring module 61 is configured to acquire a detection result of local input/output IO stream detection under a condition that a heartbeat network of the first node and a heartbeat network of the second node are normal; the obtaining module 61 is further configured to obtain a detection result of the opposite-end IO flow detection when the detection result of the local IO flow detection is abnormal; the processing module 62 is configured to determine a failure cause based on a detection result of the opposite-end IO flow detection and a detection result of the local IO flow detection, and process the trunking service according to the failure cause; the local IO stream detection is a first local IO stream detection of the first node on the IO stream between the first RAID card and the first disk in the first node, and/or a second local IO stream detection of the second node on the IO stream between the second RAID card and the second disk in the second node; the opposite-end IO stream detection is a first opposite-end IO stream detection of the first node to the IO stream between the second RAID card and the second disk in the second node, and/or a second opposite-end IO stream detection of the second node to the IO stream between the first RAID card and the first disk in the first node.
The server provided by the embodiment of the application can execute the technical scheme shown in the embodiment of the method, and the implementation principle and the beneficial effects are similar, and are not repeated here.
In a possible implementation manner, the acquiring module 61 is specifically configured to initiate a first data reading instruction to a disk of the local node by the first node and the second node respectively; the first data reading instruction is used for reading first data in a disk in the local node; acquiring first data returned by a disk of the local node based on the first data reading instruction; the detection result of the local IO stream detection is abnormal, including: the first data returned by the disk of the local node is not acquired.
In a possible implementation manner, the obtaining module 61 is specifically configured to obtain, at preset time intervals, first data returned by the disk of the local node based on the first read data instruction in a preset first time period; the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: if the sum of the times of returning the first data from the disk of the local node is less than the first time threshold value according to the preset time interval point in the preset first time period, determining that the detection result of the local IO stream detection is abnormal.
In a possible implementation manner, the obtaining module 61 is specifically configured to obtain, in a preset second period of time, first data returned by the disk of the local node based on the first data reading instruction; the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: if the difference between the acquired time of acquiring the first data returned by the disk of the local node and the preset second time is greater than a first time threshold, determining that the detection result of the local IO stream detection is abnormal; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
In a possible implementation manner, the obtaining module 61 is specifically configured to obtain, in a preset first period of time, first data returned by the disk of the local node based on the first data reading instruction; the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: in the preset first time period, the times that the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is larger than a first time threshold value are larger than a second time threshold value; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
The server provided by the embodiment of the application can execute the technical scheme shown in the embodiment of the method, and the implementation principle and the beneficial effects are similar, and are not repeated here.
In one possible implementation, the acquiring module 61 is specifically configured to initiate, by the first node or the second node, a second read data instruction to a disk in the peer node; the second read data instruction is used for reading second data in a disk in the opposite end node; acquiring second data returned by the disk in the opposite end node based on the second read data instruction; if the second data returned by the disk in the opposite terminal node is obtained, determining that the detection result of the opposite terminal IO stream detection is normal; and if the second data returned by the disk in the opposite terminal node is not obtained, determining that the detection result of the opposite terminal IO stream detection is abnormal.
The server provided by the embodiment of the application can execute the technical scheme shown in the embodiment of the method, and the implementation principle and the beneficial effects are similar, and are not repeated here.
In a possible implementation manner, the processing module 62 is specifically configured to determine that the RAID state of the first node is abnormal when the detection result of the first local IO flow detection is abnormal and the detection result of the first peer IO flow detection is normal; determining that the RAID state of the second node is abnormal under the condition that the detection result of the second local IO stream detection is abnormal and the detection result of the second opposite-end IO stream detection is normal; when the detection result of the first local IO stream detection is abnormal, the detection result of the first opposite end IO stream detection is abnormal, and the detection result of the second local IO stream detection is normal, determining that the IO stream detection service of the first node is abnormal; and determining that the IO stream detection service of the second node is abnormal under the condition that the detection result of the second local IO stream detection is abnormal, and the detection result of the second opposite end IO stream detection is normal.
The server provided by the embodiment of the application can execute the technical scheme shown in the embodiment of the method, and the implementation principle and the beneficial effects are similar, and are not repeated here.
In a possible implementation manner, the cluster service system is in a hot standby scenario, wherein a first node is a main node, and a second node is a standby node; the processing module 62 is specifically configured to switch the cluster service from the first node to operate on the second node when it is determined that the RAID card status of the first node is abnormal and the RAID card status of the second node is normal, and perform alarm processing on a soft failure of the RAID card of the first node; when the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, carrying out alarm processing on the soft fault of the RAID card of the second node; when the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, carrying out alarm processing on the RAID card soft faults of the first node and the second node; and when the IO stream detection service of the first node and/or the second node is abnormal, carrying out alarm processing of the IO stream detection fault of the first node and/or the second node.
In a possible implementation manner, the cluster service system is in a dual-activity scene, a first node and a second node are standby nodes, the first node runs a first cluster service, and the second node runs a second cluster service; the processing module 62 is specifically configured to switch the first cluster service from the first node to operate on the second node when it is determined that the RAID card status of the first node is abnormal and the RAID card status of the second node is normal, and perform alarm processing on a soft failure of the RAID card of the first node; when the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, switching the second cluster service from the second node to the first node for operation, and carrying out alarm processing on the RAID card soft fault of the second node; when the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, carrying out alarm processing on the RAID card soft faults of the first node and the second node; and when the IO stream detection service of the first node and/or the second node is abnormal, carrying out alarm processing of the IO stream detection fault of the first node and/or the second node.
The server provided by the embodiment of the application can execute the technical scheme shown in the embodiment of the method, and the implementation principle and the beneficial effects are similar, and are not repeated here.
Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 9, the server 70 includes: a processor 71, a memory 72, and a communication interface 73; wherein the memory 72 is for storing executable instructions of the processor 71; the processor 71 is configured to perform the technical solutions of any of the method embodiments described above via execution of executable instructions.
Alternatively, the memory 72 may be separate or integrated with the processor 71.
Alternatively, when the memory 72 is a device separate from the processor 71, the server 70 may further include: bus 74 for connecting the above devices.
The server is used for executing the technical scheme in any of the method embodiments, and the implementation principle and the technical effect are similar, and are not repeated here.
The embodiment of the application also provides a cluster service system. The cluster service system comprises at least one first node and at least one second node, wherein the first node is a main node, and the second node is a standby node; wherein the first node executes the technical scheme in any of the foregoing method embodiments.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features can be replaced equivalently; such modifications and substitutions do not depart from the spirit of the application.

Claims (10)

1. A method for processing a cluster service, wherein the method is applied to a cluster service system, and the cluster service system comprises: a first node and a second node; the method comprises the following steps:
Under the condition that the heartbeat network of the first node and the heartbeat network of the second node are normal, acquiring a detection result of local input/output (IO) stream detection;
under the condition that the detection result of the local IO stream detection is abnormal, obtaining the detection result of the opposite-end IO stream detection;
determining a fault reason based on the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing the cluster service according to the fault reason;
the local input/output IO stream detection is that the first node detects a first local IO stream of an IO stream between a first RAID card and a first disk in the first node, and the second node detects a second local IO stream of an IO stream between a second RAID card and a second disk in the second node;
the opposite-end IO stream detection is a first opposite-end IO stream detection of the first node to the IO stream between the second RAID card and the second disk in the second node, and/or a second opposite-end IO stream detection of the second node to the IO stream between the first RAID card and the first disk in the first node.
2. The method for processing the trunking service according to claim 1, wherein obtaining a detection result of the local IO flow detection includes:
The first node and the second node respectively initiate a first data reading instruction to a disk of a local node; the first data reading instruction is used for reading first data in a disk in the local node;
acquiring first data returned by a disk of the local node based on the first data reading instruction;
the detection result of the local IO stream detection is abnormal, including: and the first data returned by the disk of the local node is not acquired.
3. The method of claim 2, wherein the obtaining the first data returned by the disk of the local node based on the first read data instruction includes:
acquiring first data returned by the disk of the local node based on the first data reading instruction according to a preset time interval point in a preset first time period;
the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: if the sum of the times of returning the first data from the disk of the local node is smaller than a first time threshold value in the preset first time period according to the preset time interval point, determining that the detection result of the local IO stream detection is abnormal.
4. The method of claim 2, wherein the obtaining the first data returned by the disk of the local node based on the first read data instruction includes:
acquiring first data returned by a disk of the local node based on the first data reading instruction in a preset second time period;
the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: if the difference between the acquired time of acquiring the first data returned by the disk of the local node and the preset second time is greater than a first time threshold, determining that the detection result of the local IO stream detection is abnormal;
the preset second time is the maximum time when the disk of the local node returns the first data, wherein the maximum time is obtained normally.
5. The method of claim 2, wherein the obtaining the first data returned by the disk of the local node based on the first read data instruction includes:
acquiring first data returned by a disk of the local node based on the first data reading instruction in a preset first time period;
the detection result of the local IO stream detection is abnormal, and the method further comprises the following steps: in the preset first time period, the times that the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is larger than a first time threshold value are larger than a second time threshold value; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
6. The method for processing the trunking service according to claim 1, wherein the obtaining the detection result of the detection of the IO stream at the opposite end includes:
the first node or the second node initiates a second read data instruction to a disk of the opposite-end node; the second read data instruction is used for reading second data in a disk of the opposite end node;
acquiring second data returned by the disk of the opposite end node based on the second read data instruction; if the second data returned by the disk of the opposite terminal node is obtained, determining that the detection result of the opposite terminal IO stream detection is normal; and if the second data returned by the disk of the opposite terminal node is not obtained, determining that the detection result of the opposite terminal IO stream detection is abnormal.
7. The trunking service processing method according to claim 1, wherein: the determining the fault cause based on the detection result of the opposite-end IO flow detection and the detection result of the local IO flow detection includes:
if the detection result of the first local IO stream detection is abnormal and the detection result of the first opposite-end IO stream detection is normal, determining that the RAID state of the first node is abnormal;
If the detection result of the second local IO stream detection is abnormal and the detection result of the second opposite-end IO stream detection is normal, determining that the RAID state of the second node is abnormal;
if the detection result of the first local IO stream detection is abnormal, the detection result of the first opposite end IO stream detection is abnormal, and the detection result of the second local IO stream detection is normal, determining that the IO stream detection service of the first node is abnormal;
and if the detection result of the second local IO stream detection is abnormal, the detection result of the second opposite end IO stream detection is abnormal, and the detection result of the first local IO stream detection is normal, determining that the IO stream detection service of the second node is abnormal.
8. The method for processing the group service according to claim 7, wherein said processing the group service according to the failure cause comprises:
the cluster service system is in a hot standby scene, wherein the first node is a main node, and the second node is a standby node;
when the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, switching the cluster service from the first node to the second node for operation, and carrying out alarm processing on the RAID card soft fault of the first node;
When the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, carrying out alarm processing on the soft fault of the RAID card of the second node;
when the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, carrying out alarm processing on the RAID card soft faults of the first node and the second node;
and when the IO stream detection service of the first node and/or the second node is abnormal, carrying out alarm processing of the IO stream detection fault of the first node and/or the second node.
9. The method for processing the group service according to claim 7, wherein said processing the group service according to the failure cause comprises:
the cluster service system is in a double-activity scene, the first node and the second node are standby nodes, a first cluster service is operated on the first node, and a second cluster service is operated on the second node;
when the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, switching the first cluster service from the first node to the second node for operation, and carrying out alarm processing on the soft fault of the RAID card of the first node;
When the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, switching the second cluster service from the second node to the first node for operation, and carrying out alarm processing on the soft fault of the RAID card of the second node;
when the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, carrying out alarm processing on the RAID card soft faults of the first node and the second node;
and when the IO stream detection service of the first node and/or the second node is abnormal, carrying out alarm processing of the IO stream detection fault of the first node and/or the second node.
10. A clustered business system comprising:
at least one first node and at least one second node, wherein the first node is a main node, and the second node is a standby node;
wherein the first node performs the trunked service processing method of any one of claims 1 to 9.
CN202310598620.9A 2023-05-19 2023-05-19 Cluster service processing method, server and system Pending CN116668335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310598620.9A CN116668335A (en) 2023-05-19 2023-05-19 Cluster service processing method, server and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310598620.9A CN116668335A (en) 2023-05-19 2023-05-19 Cluster service processing method, server and system

Publications (1)

Publication Number Publication Date
CN116668335A true CN116668335A (en) 2023-08-29

Family

ID=87713022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310598620.9A Pending CN116668335A (en) 2023-05-19 2023-05-19 Cluster service processing method, server and system

Country Status (1)

Country Link
CN (1) CN116668335A (en)

Similar Documents

Publication Publication Date Title
US20070288585A1 (en) Cluster system
CN108984349B (en) Method and device for electing master node, medium and computing equipment
US20080288812A1 (en) Cluster system and an error recovery method thereof
US8347142B2 (en) Non-disruptive I/O adapter diagnostic testing
CN109120522B (en) Multipath state monitoring method and device
US20160197994A1 (en) Storage array confirmation of use of a path
CN109885420B (en) PCIe link fault analysis method, BMC and storage medium
CN113254245A (en) Fault detection method and system for storage cluster
CN115150253B (en) Fault root cause determining method and device and electronic equipment
CN116668335A (en) Cluster service processing method, server and system
CN114064343B (en) Abnormal handling method and device for block chain
US7475076B1 (en) Method and apparatus for providing remote alert reporting for managed resources
CN115766405A (en) Fault processing method, device, equipment and storage medium
CN111817892B (en) Network management method, system, electronic equipment and storage medium
CN111309515A (en) Disaster recovery control method, device and system
US20070286087A1 (en) Distributed Network Enhanced Wellness Checking
CN114840495A (en) Database cluster split-brain prevention method, storage medium and device
CN111934909B (en) Main-standby machine IP resource switching method, device, computer equipment and storage medium
CN114884803A (en) Method, device, equipment and medium for processing multiple redundant states
CN113794595A (en) IoT (Internet of things) equipment high-availability method based on industrial Internet
CN115686951A (en) Fault processing method and device for database server
CN110321261B (en) Monitoring system and monitoring method
CN117118986B (en) Block chain-based fault tolerance verification method, device, equipment and medium
CN114978891B (en) Processing method, device and storage medium for BIOS configuration of network device
CN113688017B (en) Automatic abnormality testing method and device for multi-node BeeGFS file system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination