CN109088794A - A kind of fault monitoring method and device of node - Google Patents

A kind of fault monitoring method and device of node Download PDF

Info

Publication number
CN109088794A
CN109088794A CN201810950141.8A CN201810950141A CN109088794A CN 109088794 A CN109088794 A CN 109088794A CN 201810950141 A CN201810950141 A CN 201810950141A CN 109088794 A CN109088794 A CN 109088794A
Authority
CN
China
Prior art keywords
node
heartbeat
heartbeat message
message
inspecting module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810950141.8A
Other languages
Chinese (zh)
Inventor
丁瑞锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810950141.8A priority Critical patent/CN109088794A/en
Publication of CN109088794A publication Critical patent/CN109088794A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Abstract

The embodiment of the invention discloses a kind of fault monitoring method of node and devices.This method comprises: receiving the first heartbeat message and the second heartbeat message that second node is sent by the heartbeat inspecting module of first node, the first heartbeat message and the second heartbeat message are caused by the adjacent heartbeat twice of second node;Then, which calculates the first time interval between the first heartbeat message and the second heartbeat message;If first time interval is greater than preset time threshold, heartbeat inspecting module can determine second node failure.In this way, remove to handle the heartbeat message of other nodes transmission by heartbeat inspecting module independent on node, complete the state of other nodes of monitoring, it overcomes with heartbeat message and service message using same processing unit, cause to be led to the problem of node failure monitoring erroneous judgement by the interference that service message is handled, the accuracy for improving node failure monitoring, to improve the reliability and safety of distributed memory system.

Description

A kind of fault monitoring method and device of node
Technical field
The present invention relates to technical field of distributed memory, more particularly to the fault monitoring method and device of a kind of node.
Background technique
With the arrival of information age, it is contemplated that the safety and reliability of information, traditional centrally stored system (that is, By the storage system that all data are stored together) it will be unable to meet demand, in this way, data are respectively stored in more independences Storage server on distributed memory system applied with regard to more and more extensive.Distributed memory system is deposited using more Storage server shares storage load, positions storage information using location server, and the reliability of storage system can be improved, can be used Property and access efficiency, are also easy to extend.
Each node carries out mutual failure prison by the way of sending and receiving heartbeat message in distributed memory system It surveys, that is, each node is constantly sent out heartbeat message, and other nodes will receive heartbeat message and judge whether node is sent out Failure is sent, as long as being more than the heartbeat message that specific duration is not received by some node, can determine the node failure.But It is that a node is generally only configured with an event loop processing center, which not only handles heartbeat and disappear Breath, also processing various businesses message (such as: the recovery of database, the inquiry of data, the service messages such as node election), as long as certain A service message handles overlong time, it is possible to the transmission of heartbeat message can be blocked, so that the erroneous judgement to the node state is caused, That is, mistake thinks that the node has broken down, and then influence distributed memory system normal work.
Therefore, in order to improve the accuracy and reliability of distributed memory system, urgently providing one kind at present can be to avoid The node fault monitoring method of heartbeat message erroneous judgement, to improve the reliability of the distributed memory system.
Summary of the invention
Technical problems to be solved in this application are to provide the fault monitoring method and device of a kind of node, so that score Each node in cloth storage system breaks down at any time, can be monitored, be avoided by heartbeat message Failure is caused to be judged by accident since the processing time of service message is too long, so as to improve the reliability and peace of distributed memory system Quan Xing.
In a first aspect, providing a kind of fault monitoring method of node, comprising:
The heartbeat inspecting module of first node receives the first heartbeat message and the second heartbeat message that second node is sent;Institute It states the first heartbeat message and second heartbeat message is caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates the first time between first heartbeat message and second heartbeat message Interval;
If the first time interval is greater than preset time threshold, the heartbeat inspecting module determines second section Point failure.
Optionally, this method further include:
When the second node is determined breaking down, the heartbeat inspecting module sends service disconnection instruction, is used for It indicates to interrupt the business with the second node;Also, service disconnection instruction is also used to trigger drift IP operation and more New database operation.
Optionally, after the second node is determined breaking down, this method further include:
The heartbeat inspecting module of first node receives the third heartbeat message and the 4th heartbeat message that second node is sent;Institute Stating third heartbeat message and the 4th heartbeat message is caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates second received between the third heartbeat message and the 4th heartbeat message Time interval;
If second time interval is not more than the preset time threshold, described in the heartbeat inspecting module determines Second node restores connection.
Optionally, this method further include:
When the second node is determined restoring connection, the heartbeat inspecting module sends business recovery instruction, is used for It indicates to restore the business with the second node;Also, business recovery instruction is also used to trigger drift IP operation and more New database operation.
Optionally, if the second node is host node, the service disconnection instruction/business recovery instruction is also used to Triggering elects the operation of new host node.
Optionally, this method further include:
When increasing/deletion third node in the distributed memory system where the first node newly, then,
The heartbeat inspecting module of the first node receives newly-increased/deletion monitoring instruction;
The heartbeat inspecting module is according to the newly-increased/deletion monitoring instruction, the heart of the increase/deletion to the third node The malfunction monitoring of hop-information.
Second aspect additionally provides a kind of fault monitoring device of node, comprising:
First receiving unit, the heartbeat inspecting module for first node receive the first heartbeat message that second node is sent With the second heartbeat message;First heartbeat message and second heartbeat message are the adjacent heartbeats twice of the second node It is generated;
First computing unit calculates first heartbeat message for the heartbeat inspecting module and second heartbeat disappears First time interval between breath;
First determination unit, if being greater than preset time threshold, the heartbeat inspecting for the first time interval Module determines the second node failure.
Optionally, the device further include:
First transmission unit, for when the second node is determined breaking down, the heartbeat inspecting module to be sent Service disconnection instruction, is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to trigger Drift about IP operation and update database manipulation.
Optionally, after the second node is determined breaking down, the device further include:
Second receiving unit, the heartbeat inspecting module for first node receive the third heartbeat message that second node is sent With the 4th heartbeat message;The third heartbeat message and the 4th heartbeat message are the adjacent heartbeats twice of the second node It is generated;
Second computing unit calculates for the heartbeat inspecting module and receives the third heartbeat message and the 4th heart Jump the second time interval between message;
Second determination unit, if being not more than the preset time threshold, the heart for second time interval It jumps monitoring modular and determines that the second node restores connection.
Optionally, the device further include:
Second transmission unit, for when the second node is determined restoring connection, the heartbeat inspecting module to be sent Business recovery instruction, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to trigger Drift about IP operation and update database manipulation.
Optionally, if the second node is host node, the service disconnection instruction/business recovery instruction is also used to Triggering elects the operation of new host node.
Optionally, the device further include:
Third receiving unit, for when newly-increased in the distributed memory system where the first node/deletion third section Point, then, the heartbeat inspecting module of the first node receive newly-increased/deletion monitoring instruction;
Monitoring unit, for the heartbeat inspecting module according to the newly-increased/deletion monitoring instruction, increase/deletion is to institute State the malfunction monitoring of the heartbeat message of third node.
In the embodiment of the present application, it is sent it is possible, firstly, to receive second node by the heartbeat inspecting module of first node The first heartbeat message and the second heartbeat message, first heartbeat message and second heartbeat message are the second nodes Adjacent heartbeat twice caused by;Then, which calculates first heartbeat message and second heartbeat First time interval between message;If the first time interval is greater than preset time threshold, the heartbeat inspecting mould Block can determine the second node failure.In this way, going to handle other nodes hair by heartbeat inspecting module independent on node The heartbeat message sent overcomes with heartbeat message and service message to go to monitor whether other nodes send failure using same One processing unit causes to be led to the problem of node failure monitoring erroneous judgement by the interference that service message is handled, improves node The accuracy of malfunction monitoring, to improve the reliability and safety of distributed memory system.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations as described in this application Example, for those of ordinary skill in the art, is also possible to obtain other drawings based on these drawings.
Fig. 1 is network system block schematic illustration involved in an application scenarios in the embodiment of the present invention;
Fig. 2 is a kind of flow diagram of the fault monitoring method of node provided by the embodiments of the present application;
Fig. 3 is the structural schematic diagram of heartbeat inspecting module provided by the embodiments of the present application;
Fig. 4 is the flow diagram of the fault monitoring method of another node provided by the embodiments of the present application;
Fig. 5 is the flow diagram of the fault monitoring method of another node provided by the embodiments of the present application;
Fig. 6 is a kind of structural schematic diagram of the fault monitoring device of node provided by the embodiments of the present application.
Specific embodiment
Data are respectively stored in more independent storage servers by distributed memory system, are taken using more storages Business device shares storage load, positions storage information using location server, can be improved the reliability of storage system, availability and Access efficiency is also easy to extend, the extensive favor by user.The malfunction monitoring of each node, is adopted in distributed memory system It is carried out with the mode for sending and receiving heartbeat message.But a node due to only be configured with an event loop processing center, For handling heartbeat message and various businesses message, then, as long as there is time delay in some service message processing time, it is possible to meeting The transmission for blocking heartbeat message, to cause the erroneous judgement to the node state, that is, mistake thinks that the node has broken down, And then influence distributed memory system normal work.
Based on this, in order to improve the accuracy and reliability of distributed memory system, the embodiment of the invention provides one kind The node fault monitoring method that can be judged by accident to avoid heartbeat message: it is possible, firstly, to which the heartbeat inspecting module by first node connects Receive the first heartbeat message and the second heartbeat message that second node is sent, first heartbeat message and second heartbeat message It is caused by the adjacent heartbeat twice of the second node;Then, which calculates first heartbeat message First time interval between second heartbeat message;If the first time interval is greater than preset time threshold, The heartbeat inspecting module can determine the second node failure.In this way, being gone by heartbeat inspecting module independent on node The heartbeat message for handling the transmission of other nodes, to go to monitor whether other nodes send failure, overcome with heartbeat message with Service message uses same processing unit, causes to be generated asking for node failure monitoring erroneous judgement by the interference that service message is handled Topic improves the accuracy of node failure monitoring, to improve the reliability and safety of distributed memory system.
For example, one of the scene of the embodiment of the present invention, can be applied in scene as shown in Figure 1.The scene In distributed memory system can be CTDB, wherein the CTDB be a cluster TDB database, can by Samba or its He using carrying out storing data.If an application is temporarily to store data using TDB, this application can be very Easily cluster mode is extended to using CTDB.CTDB provides function interface identical with TDB, and is building in more physics Cluster on machine.For the distributed memory system, each node can run CTDB service, can externally provide void Quasi- IP, ensures the orderly storage process of the distributed memory system.
The distributed memory system 100 may include node 110, node 120 ..., node 1N0 (wherein, N be greater than etc. In 2), wherein for node 110, comprising: server 111, heartbeat inspecting module 112 and CTDB main thread 113;For node 120, including server 121, heartbeat inspecting module 122 and CTDB main thread 123;Similarly, for node 1N0, including server 1N1, heartbeat inspecting module 1N2 and CTDB main thread 1N3.
As an example, the heartbeat inspecting module 112 of the node 110 can receive other nodes (including node 120, node 1N0) heartbeat message that sends, and handle the heartbeat message received, specifically include: judgement receives same Whether the interval time between adjacent heartbeat message twice that one node 1N0 is sent is in preset time threshold, if surpassed The time threshold out, then the node 110 can then determine that failure has occurred in node 1N0.
It is understood that above-mentioned scene is only a Sample Scenario provided in an embodiment of the present invention, the embodiment of the present invention It is not limited to this scene.
With reference to the accompanying drawing, by embodiment come the fault monitoring method and dress of the present invention will be described in detail embodiment interior joint The specific implementation set.
Referring to fig. 2, a kind of fault monitoring method of node provided by the embodiments of the present application is shown.This method for example can be with Include:
Step 201, the heartbeat inspecting module of first node receives the first heartbeat message and second heart that second node is sent Jump message;First heartbeat message and second heartbeat message are produced by the adjacent heartbeat twice of the second node 's.
It is understood that each node can pass through heartbeat inspecting thereon for distributed memory system Module sends heartbeat message, also, the heartbeat inspecting module transmission of other nodes can also be received by heartbeat inspecting module Heartbeat message determines event in this way, each node can be monitored the state of other nodes by the rule of heartbeat message The node of barrier, to ensure the reliability service of the distributed memory system.It is supervised in the present embodiment with first node by heartbeat It is illustrated for the state of survey module monitors second node, in fact, second node can also be supervised by heartbeat inspecting module The state of first node is surveyed, the method that other nodes carry out malfunction monitoring using the embodiment of the present invention each other may refer to this The description of embodiment, repeats no more.
Wherein, which can be any one node in distributed memory system, including heartbeat inspecting module, For receiving the heartbeat message of other nodes transmission.And heartbeat message, refer to node in the state of normal work, periodically outward The message of transmission, for embodying the state of the node.Second node is also possible to be different from first in distributed memory system The node of node, the second node can constantly issue heartbeat message by heartbeat inspecting module thereon.
It is the structural schematic diagram of heartbeat inspecting module provided by the embodiment of the present invention referring to Fig. 3.The heartbeat inspecting module 300 include external communication module 301 and internal communication module 302.
Wherein, external communication module 301 is responsible for sending and receiving the heartbeat message of other nodes;Internal communication module 302 It is responsible for the host process (such as CTDB host process) internal with it to be communicated, informs the current heartbeat inspecting result of the host process.
When specific implementation, second node can be in the external communication in heartbeat inspecting module of first moment to first node Module send the first heartbeat message, and the external communication module in heartbeat inspecting module of second moment to the first node after Supervention send the second heartbeat message, wherein the second node does not issue any one heart again between the first moment and the second moment Jump message, that is, first heartbeat message and second heartbeat message are that the adjacent heartbeat twice of the second node is produced Raw.So, the external communication module in the heartbeat inspecting module of first node can successively receive the first heartbeat message and Second heartbeat message.
In addition, the heartbeat inspecting module of the first node can also obtain the first heartbeat message and the second heartbeat message respectively Corresponding sending instant or the time of reception, specifically, the heartbeat inspecting module of the first node can lead in a kind of situation The first heartbeat message that parsing receives is crossed, obtains the first sending instant being carried in first heartbeat message, that is, the second section At the time of point sends first heartbeat message;Similarly, the heartbeat inspecting module of the first node can also be received by parsing The second heartbeat message, obtain the second sending instant for being carried in second heartbeat message, that is, second node send this second At the time of heartbeat message.In another case, the heartbeat inspecting module of the first node can also receive first heart by record At the time of jumping message, using the moment as first time of reception;Similarly, the heartbeat inspecting module of the first node can also pass through Record receive the and at the time of heartbeat message, using the moment as second time of reception.
It is understood that when obtaining first heartbeat message and the corresponding sending instant of the second heartbeat message or receiving It carves, calculates first time interval for subsequent step 202 and be ready.
Step 202, heartbeat inspecting module, which calculates, receives between first heartbeat message and second heartbeat message First time interval.
It, can be by the external communication module or internal communication module of heartbeat inspecting module, to what is got when specific implementation Perhaps the time of reception calculates transmission or receives first heart for first heartbeat message and the corresponding sending instant of the second heartbeat message The time interval between message and the second heartbeat message is jumped, first time interval is denoted as, can indicate a heart beat cycle.
As an example, when heartbeat monitoring modular obtains the first sending instant T of first heartbeat messageS1With second heart Jump the second sending instant T of messageS2, it is possible to the second sending instant TS2Subtract the first sending instant TS1, obtain this One sending instant to the second sending instant duration T experiencedS, it is denoted as first time interval.
As another example, when heartbeat monitoring modular obtains the first time of reception T of first heartbeat messageR1With second Second time of reception T of heartbeat messageR2, it is possible to the second time of reception TR2Subtract the first time of reception TR1, it is somebody's turn to do First time of reception to second time of reception duration T experiencedR, it is denoted as first time interval.
Step 203, if the first time interval is greater than preset time threshold, the heartbeat inspecting module determines institute State second node failure.
It is understood that time threshold, can be the maximum permissible value of pre-set heart beat cycle, that is, when first First time interval between heartbeat message and the second heartbeat message is greater than the time threshold, then it represents that the heartbeat of the second node Message sends exception, may thereby determine that the second node failure;Otherwise, when between the first heartbeat message and the second heartbeat message First time interval step be greater than the time threshold, then it represents that the heartbeat message of the second node send it is normal, so as to after Continue the malfunction monitoring of next heart beat cycle.
When specific implementation, by the external communication module or the judgement of internal communication module in the heartbeat inspecting module of first node Time threshold and the size for calculating resulting first time interval;When the heartbeat inspecting module determines that the first time interval is greater than Time threshold then determines the second node failure;When the heartbeat inspecting module determines that the first time interval is less than or equal to the time Threshold value then determines that the second node is normal.
It should be noted that foregoing description is only after sending failure with second node, sending the second heartbeat message can be produced The case where giving birth to too long time delay, then, if no longer generating the second heartbeat message, Ye Ji after second node sends failure When the heartbeat inspecting module of one node will not receive the second heartbeat message, then, first time interval by approach infinity, Malfunction monitoring can be carried out through this embodiment, determine the second node failure.
As a possible example, when the second node is determined breaking down, the heartbeat inspecting module hair It send service disconnection to instruct, is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to touch Hair drift IP operation and update database manipulation.
Wherein, service disconnection instruction specifically can be the internal communication module of the heartbeat inspecting module of first node internally Host process send.It not only may include the mark of the malfunctioning node of business to be interrupted in service disconnection instruction, that is, second The mark of node can also include that drift IP operates corresponding operational order and updates the corresponding operation life of database manipulation It enables.Therefore, after receiving service disconnection instruction, it can know the target that specifically failure has occurred, needs to interrupt related service Node, and the operation of the corresponding drift IP and more new database of operation command can be triggered.
It should be noted that if the second node is host node, hair election can also be wrapped in service disconnection instruction The corresponding operational order of the operation of new host node out, for triggering execution operational order pair when second node breaks down The operation of the new host node of the election answered.
After step 203, that is, after determining second node failure, in order to replacement after failure or reparation Second node realize it is flexible restore, improve resource utilization, then, as shown in figure 4, node provided in an embodiment of the present invention Fault monitoring method can also include:
Step 404, the heartbeat inspecting module of first node receives the third heartbeat message and the 4th heart that second node is sent Jump message;The third heartbeat message and the 4th heartbeat message are produced by the adjacent heartbeat twice of the second node 's;
Step 405, the heartbeat inspecting module calculate receive the third heartbeat message and the 4th heartbeat message it Between the second time interval;
Step 406, if second time interval is not more than the preset time threshold, the heartbeat inspecting module Determine that the second node restores connection.
It is understood that the third heartbeat message and the 4th heartbeat message may refer to the first heartbeat message and second heart The associated description of message is jumped, the calculation of the second time interval also may refer to the calculation of first time interval, here It repeats no more.
It should be noted that after determining second node failure, if the outside in the heartbeat inspecting module of first node Communication module receives two heartbeat message caused by the adjacent heartbeat twice of second node in a certain period of time again, that is, the Three heartbeat message and the 4th heartbeat message, then, the heartbeat inspecting module of the first node may refer to the mode meter of step 202 Calculate the second time interval in step 405.
When the second time interval is not more than preset time threshold, illustrate that the heartbeat message of the second node has restored just Normal state, then, the heartbeat inspecting module of the first node can determine the recovered connection status of the second node.It is no Then, when preset time threshold is still greater than in the second time interval, illustrate that the heartbeat message of the second node still has event Barrier.
As a possible example, when the second node is determined restoring connection, the heartbeat inspecting module hair It send business recovery to instruct, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to touch Hair drift IP operation and update database manipulation.
Wherein, business recovery instruction specifically can be the internal communication module of the heartbeat inspecting module of first node internally Host process send.It not only may include the mark of the malfunctioning node of business to be restored in business recovery instruction, that is, second The mark of node can also include that drift IP operates corresponding operational order and updates the corresponding operation life of database manipulation It enables.Therefore, after receiving business recovery instruction, it can know the target for being specifically interrupted business, needing to restore related service Node, and the operation of the corresponding drift IP and more new database of operation command can be triggered.
It should be noted that if before second node failure being host node, hair can also be wrapped in business recovery instruction The corresponding operational order of operation of new host node is elected, for the triggering execution operation life when second node restores to connect The operation for the host node for enabling corresponding election new.
In addition, in order to adapt to the monitoring of the node failure of the distributed memory system changeable to structure, can also to newly-increased or The node that person deletes is monitored, and in real time by node updates that are newly-increased or deleting into database, more smart to complete True and reliable malfunction monitoring.When specific implementation, as shown in figure 5, the embodiment of the present invention can also include:
Step 507, when newly-increased in the distributed memory system where the first node/deletion third node, then, described The heartbeat inspecting module of first node receives newly-increased/deletion monitoring instruction;
Step 508, the heartbeat inspecting module is according to the newly-increased/deletion monitoring instruction, and increase/deletion is to the third The malfunction monitoring of the heartbeat message of node.
It is understood that when increasing or deleting third node in distributed memory system newly, then, main thread can touch Send out the heartbeat inspecting module of on/off corresponding node;And to the heartbeat for notifying other nodes in the distributed memory system Monitoring modular increases/deletes monitoring instruction newly, is used to indicate the increase of these nodes or deletion to the heartbeat inspecting of third node.
As an example, when having increased third node in distributed memory system newly, at this point, main thread can trigger starting The heartbeat inspecting module of the third node, and send newly-increased monitoring instruction to the heartbeat inspecting module of first node, wherein can be with The mark (that is, mark of third node) for carrying newly-increased node, for notifying the heartbeat inspecting module of first node to increase to this The heartbeat inspecting of third node.Hereafter, the heartbeat inspecting module of third node can periodically issue heartbeat message, and first segment The heartbeat message that the heartbeat inspecting module of point can also be issued by receiving the third node, the shape of the real-time monitoring third node Whether state breaks down.
As another example, when deleting third node in distributed memory system, at this point, main thread can trigger pass The heartbeat inspecting module of the third node is closed, and is sent to the heartbeat inspecting module of first node and deletes monitoring instruction, wherein can To carry the mark (that is, mark of third node) of deletion of node, for notifying the heartbeat inspecting module deletion pair of first node The heartbeat inspecting of the third node.Hereafter, the heartbeat inspecting module of third node will not then issue heartbeat message, and first node Heartbeat inspecting module also just no longer need to the state of the real-time monitoring third node and whether break down.It should be noted that working as The operation that " the heartbeat inspecting module that the third node is closed in triggering " when deleting third node, can not also be executed, even if third The heartbeat inspecting module of node still can periodically issue heartbeat message, and the same first node can also be supervised according to the deletion Control instruction, shields the heartbeat message of the third node, and then do not go to monitor the state of the third node.
As it can be seen that in the embodiment of the present application, it is possible, firstly, to which the heartbeat inspecting module by first node receives second node The first heartbeat message and the second heartbeat message sent, first heartbeat message and second heartbeat message are described second Caused by the adjacent heartbeat twice of node;Then, which calculates first heartbeat message and described second First time interval between heartbeat message;If the first time interval is greater than preset time threshold, the heartbeat prison The second node failure can be determined by surveying module.In this way, going to handle other sections by heartbeat inspecting module independent on node The heartbeat message that point is sent overcomes and is adopted with heartbeat message and service message to go to monitor other nodes whether send failure With same processing unit, causes to be led to the problem of node failure monitoring erroneous judgement by the interference that service message is handled, improve The accuracy of node failure monitoring, to improve the reliability and safety of distributed memory system.
Correspondingly, referring to Fig. 6, the device is specific the embodiment of the present application also provides a kind of fault monitoring device of node May include:
First receiving unit 601, the heartbeat inspecting module for first node receive the first heartbeat that second node is sent Message and the second heartbeat message;First heartbeat message and second heartbeat message be the second node it is adjacent twice Caused by heartbeat;
First computing unit 602 calculates first heartbeat message and second heart for the heartbeat inspecting module Jump the first time interval between message;
First determination unit 603, if being greater than preset time threshold, the heartbeat prison for the first time interval It surveys module and determines the second node failure.
Optionally, the device further include:
First transmission unit, for when the second node is determined breaking down, the heartbeat inspecting module to be sent Service disconnection instruction, is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to trigger Drift about IP operation and update database manipulation.
Optionally, after the second node is determined breaking down, the device further include:
Second receiving unit, the heartbeat inspecting module for first node receive the third heartbeat message that second node is sent With the 4th heartbeat message;The third heartbeat message and the 4th heartbeat message are the adjacent heartbeats twice of the second node It is generated;
Second computing unit calculates for the heartbeat inspecting module and receives the third heartbeat message and the 4th heart Jump the second time interval between message;
Second determination unit, if being not more than the preset time threshold, the heart for second time interval It jumps monitoring modular and determines that the second node restores connection.
Optionally, the device further include:
Second transmission unit, for when the second node is determined restoring connection, the heartbeat inspecting module to be sent Business recovery instruction, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to trigger Drift about IP operation and update database manipulation.
Optionally, if the second node is host node, the service disconnection instruction/business recovery instruction is also used to Triggering elects the operation of new host node.
Optionally, the device further include:
Third receiving unit, for when newly-increased in the distributed memory system where the first node/deletion third section Point, then, the heartbeat inspecting module of the first node receive newly-increased/deletion monitoring instruction;
Monitoring unit, for the heartbeat inspecting module according to the newly-increased/deletion monitoring instruction, increase/deletion is to institute State the malfunction monitoring of the heartbeat message of third node.
Foregoing description is a kind of associated description of the fault monitoring device of node, wherein specific implementation and is reached Effect, may refer to a kind of description of the fault monitoring method embodiment of node shown in Fig. 2, which is not described herein again.
" first " in the titles such as " the first heartbeat message " mentioned in the embodiment of the present invention, " first node " is used only to Name mark is done, first sequentially is not represented.The rule is equally applicable to " second " etc..
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of the steps in example method can add the mode of general hardware platform to realize by software.Based on this understanding, Technical solution of the present invention can be embodied in the form of software products, which can store is situated between in storage In matter, such as read-only memory (English: read-only memory, ROM)/RAM, magnetic disk, CD etc., including some instructions to So that a computer equipment (can be the network communication equipments such as personal computer, server, or router) executes Method described in certain parts of each embodiment of the present invention or embodiment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.Method and apparatus embodiment described above is only schematical, wherein saying as separation unit Bright module may or may not be physically separated, and the component shown as module can be or can not also It is physical module, it can it is in one place, or may be distributed over multiple network units.It can be according to actual need Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying Out in the case where creative work, it can understand and implement.
The above is only a preferred embodiment of the present invention, it is not intended to limit the scope of the present invention.It should refer to Out, for those skilled in the art, under the premise of not departing from the present invention, can also make several improvements And retouching, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims (10)

1. a kind of fault monitoring method of node characterized by comprising
The heartbeat inspecting module of first node receives the first heartbeat message and the second heartbeat message that second node is sent;Described Heartbeat message and second heartbeat message are caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates the first time interval between first heartbeat message and second heartbeat message;
If the first time interval is greater than preset time threshold, the heartbeat inspecting module determines the second node event Barrier.
2. the method according to claim 1, wherein further include:
When the second node is determined breaking down, the heartbeat inspecting module sends service disconnection instruction, is used to indicate Interrupt the business with the second node;Also, the service disconnection instruction is also used to trigger drift IP operation and updates number It is operated according to library.
3. the method according to claim 1, wherein also being wrapped after the second node is determined breaking down It includes:
The heartbeat inspecting module of first node receives the third heartbeat message and the 4th heartbeat message that second node is sent;Described Three heartbeat message and the 4th heartbeat message are caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates the second time received between the third heartbeat message and the 4th heartbeat message Interval;
If second time interval is not more than the preset time threshold, the heartbeat inspecting module determines described second Node restores connection.
4. according to the method described in claim 3, it is characterized by further comprising:
When the second node is determined restoring connection, the heartbeat inspecting module sends business recovery instruction, is used to indicate Restore the business with the second node;Also, the business recovery instruction is also used to trigger drift IP operation and updates number It is operated according to library.
5. method according to claim 2 or 4, which is characterized in that if the second node is host node, the industry Business interrupt instruction/business recovery instruction is also used to trigger the operation for electing new host node.
6. method described in any one according to claim 1~5, which is characterized in that further include:
When increasing/deletion third node in the distributed memory system where the first node newly, then,
The heartbeat inspecting module of the first node receives newly-increased/deletion monitoring instruction;
The heartbeat inspecting module believes the heartbeat of the third node according to the newly-increased/deletion monitoring instruction, increase/deletion The malfunction monitoring of breath.
7. a kind of fault monitoring device of node characterized by comprising
First receiving unit, the heartbeat inspecting module for first node receive the first heartbeat message and that second node is sent Two heartbeat message;First heartbeat message and second heartbeat message are that the adjacent heartbeat twice of the second node is produced Raw;
First computing unit, for the heartbeat inspecting module calculate first heartbeat message and second heartbeat message it Between first time interval;
First determination unit, if being greater than preset time threshold, the heartbeat inspecting module for the first time interval Determine the second node failure.
8. device according to claim 7, which is characterized in that further include:
First transmission unit, for when the second node is determined breaking down, the heartbeat inspecting module to send business Interrupt instruction is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to trigger drift IP operation and update database manipulation.
9. device according to claim 7, which is characterized in that after the second node is determined breaking down, also wrap It includes:
Second receiving unit, the heartbeat inspecting module for first node receive the third heartbeat message and that second node is sent Four heartbeat message;The third heartbeat message and the 4th heartbeat message are that the adjacent heartbeat twice of the second node is produced Raw;
Second computing unit calculates the reception third heartbeat message for the heartbeat inspecting module and the 4th heartbeat disappears The second time interval between breath;
Second determination unit, if being not more than the preset time threshold, the heartbeat prison for second time interval It surveys module and determines that the second node restores connection.
10. device according to claim 9, which is characterized in that further include:
Second transmission unit, for when the second node is determined restoring connection, the heartbeat inspecting module to send business Restore instruction, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to trigger drift IP operation and update database manipulation.
CN201810950141.8A 2018-08-20 2018-08-20 A kind of fault monitoring method and device of node Pending CN109088794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810950141.8A CN109088794A (en) 2018-08-20 2018-08-20 A kind of fault monitoring method and device of node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810950141.8A CN109088794A (en) 2018-08-20 2018-08-20 A kind of fault monitoring method and device of node

Publications (1)

Publication Number Publication Date
CN109088794A true CN109088794A (en) 2018-12-25

Family

ID=64793893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810950141.8A Pending CN109088794A (en) 2018-08-20 2018-08-20 A kind of fault monitoring method and device of node

Country Status (1)

Country Link
CN (1) CN109088794A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110611603A (en) * 2019-09-09 2019-12-24 苏州浪潮智能科技有限公司 Cluster network card monitoring method and device
CN111586110A (en) * 2020-04-22 2020-08-25 广州锦行网络科技有限公司 Optimization processing method for raft in point-to-point fault
CN111651294A (en) * 2020-05-13 2020-09-11 浙江华创视讯科技有限公司 Node abnormity detection method and device
CN112118145A (en) * 2019-06-19 2020-12-22 北京沃东天骏信息技术有限公司 Node state monitoring method, control device and monitoring device
CN112671603A (en) * 2020-12-15 2021-04-16 中国联合网络通信集团有限公司 Fault detection method and server
CN112865993A (en) * 2019-11-27 2021-05-28 上海哔哩哔哩科技有限公司 Method and device for switching slave nodes in distributed master-slave system
CN112988463A (en) * 2021-02-23 2021-06-18 新华三大数据技术有限公司 Fault node isolation method and device
CN113114535A (en) * 2021-04-12 2021-07-13 北京字跳网络技术有限公司 Network fault detection method and device and electronic equipment
WO2021249173A1 (en) * 2020-06-12 2021-12-16 华为技术有限公司 Distributed storage system, abnormality processing method therefor, and related device
CN114007246A (en) * 2021-10-29 2022-02-01 北京天融信网络安全技术有限公司 Method, apparatus, computer device and medium for reducing network congestion
CN114750774A (en) * 2021-12-20 2022-07-15 广州汽车集团股份有限公司 Safety monitoring method and automobile
CN115333983A (en) * 2022-08-16 2022-11-11 超聚变数字技术有限公司 Heartbeat management method and node
CN116684256A (en) * 2023-08-01 2023-09-01 苏州浪潮智能科技有限公司 Node fault monitoring method, device and system, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989999A (en) * 2010-11-12 2011-03-23 华中科技大学 Hierarchical storage system in distributed environment
CN102394914A (en) * 2011-09-22 2012-03-28 浪潮(北京)电子信息产业有限公司 Cluster brain-split processing method and device
CN103117901A (en) * 2013-02-01 2013-05-22 华为技术有限公司 Distributed heartbeat detection method, device and system
CN105245531A (en) * 2015-10-21 2016-01-13 北京捷思锐科技股份有限公司 Disconnection detection method, device and server
CN105446852A (en) * 2014-09-28 2016-03-30 中国航空工业集团公司西安飞机设计研究所 High reliability cascaded heartbeat design method
CN107360239A (en) * 2017-07-25 2017-11-17 郑州云海信息技术有限公司 A kind of client connection status detection method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989999A (en) * 2010-11-12 2011-03-23 华中科技大学 Hierarchical storage system in distributed environment
CN102394914A (en) * 2011-09-22 2012-03-28 浪潮(北京)电子信息产业有限公司 Cluster brain-split processing method and device
CN103117901A (en) * 2013-02-01 2013-05-22 华为技术有限公司 Distributed heartbeat detection method, device and system
CN105446852A (en) * 2014-09-28 2016-03-30 中国航空工业集团公司西安飞机设计研究所 High reliability cascaded heartbeat design method
CN105245531A (en) * 2015-10-21 2016-01-13 北京捷思锐科技股份有限公司 Disconnection detection method, device and server
CN107360239A (en) * 2017-07-25 2017-11-17 郑州云海信息技术有限公司 A kind of client connection status detection method and system

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112118145A (en) * 2019-06-19 2020-12-22 北京沃东天骏信息技术有限公司 Node state monitoring method, control device and monitoring device
CN110611603B (en) * 2019-09-09 2021-08-31 苏州浪潮智能科技有限公司 Cluster network card monitoring method and device
CN110611603A (en) * 2019-09-09 2019-12-24 苏州浪潮智能科技有限公司 Cluster network card monitoring method and device
CN112865993B (en) * 2019-11-27 2022-10-14 上海哔哩哔哩科技有限公司 Method and device for switching slave nodes in distributed master-slave system
CN112865993A (en) * 2019-11-27 2021-05-28 上海哔哩哔哩科技有限公司 Method and device for switching slave nodes in distributed master-slave system
CN111586110A (en) * 2020-04-22 2020-08-25 广州锦行网络科技有限公司 Optimization processing method for raft in point-to-point fault
CN111651294A (en) * 2020-05-13 2020-09-11 浙江华创视讯科技有限公司 Node abnormity detection method and device
CN113805788A (en) * 2020-06-12 2021-12-17 华为技术有限公司 Distributed storage system and exception handling method and related device thereof
WO2021249173A1 (en) * 2020-06-12 2021-12-16 华为技术有限公司 Distributed storage system, abnormality processing method therefor, and related device
CN113805788B (en) * 2020-06-12 2024-04-09 华为技术有限公司 Distributed storage system and exception handling method and related device thereof
CN112671603A (en) * 2020-12-15 2021-04-16 中国联合网络通信集团有限公司 Fault detection method and server
CN112988463A (en) * 2021-02-23 2021-06-18 新华三大数据技术有限公司 Fault node isolation method and device
CN112988463B (en) * 2021-02-23 2022-08-30 新华三大数据技术有限公司 Fault node isolation method and device
CN113114535A (en) * 2021-04-12 2021-07-13 北京字跳网络技术有限公司 Network fault detection method and device and electronic equipment
CN114007246A (en) * 2021-10-29 2022-02-01 北京天融信网络安全技术有限公司 Method, apparatus, computer device and medium for reducing network congestion
CN114007246B (en) * 2021-10-29 2024-02-02 北京天融信网络安全技术有限公司 Method, apparatus, computer device and medium for reducing network congestion
CN114750774A (en) * 2021-12-20 2022-07-15 广州汽车集团股份有限公司 Safety monitoring method and automobile
CN115333983A (en) * 2022-08-16 2022-11-11 超聚变数字技术有限公司 Heartbeat management method and node
CN115333983B (en) * 2022-08-16 2023-10-10 超聚变数字技术有限公司 Heartbeat management method and node
CN116684256A (en) * 2023-08-01 2023-09-01 苏州浪潮智能科技有限公司 Node fault monitoring method, device and system, electronic equipment and storage medium
CN116684256B (en) * 2023-08-01 2023-11-03 苏州浪潮智能科技有限公司 Node fault monitoring method, device and system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109088794A (en) A kind of fault monitoring method and device of node
CN106170971B (en) Arbitration process method, arbitration storage device and system after a kind of cluster fissure
CN105187249B (en) A kind of fault recovery method and device
CN105095001B (en) Virtual machine abnormal restoring method under distributed environment
US5805785A (en) Method for monitoring and recovery of subsystems in a distributed/clustered system
CN104113428B (en) A kind of equipment management device and method
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
CN109728981A (en) A kind of cloud platform fault monitoring method and device
CN109308252A (en) A kind of fault location processing method and processing device
CN109286529A (en) A kind of method and system for restoring RabbitMQ network partition
CN108429629A (en) Equipment fault restoration methods and device
EP3724761B1 (en) Failure handling in a cloud environment
CN105245381B (en) Cloud Server delay machine monitors migratory system and method
CN106789306A (en) Restoration methods and system are collected in communication equipment software fault detect
TWI691852B (en) Error detection device and error detection method for detecting failure of hierarchical system, computer-readable recording medium and computer program product
CN109101196A (en) Host node switching method, device, electronic equipment and computer storage medium
Araujo et al. Dependability evaluation of a mhealth system using a mobile cloud infrastructure
CN103490914A (en) Switching system and switching method for multi-machine hot standby of network application equipment
CN104283780A (en) Method and device for establishing data transmission route
CN107153595A (en) The fault detection method and its system of distributed data base system
CN108293003A (en) Distribution figure handles the fault-tolerant of network
CN107291589A (en) Method for improving system reliability in robot operating system
CN108718398A (en) Code stream transmission method, device and the conference facility of video conferencing system
CN110224872B (en) Communication method, device and storage medium
US10164856B2 (en) Reconciliation of asymmetric topology in a clustered environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181225

RJ01 Rejection of invention patent application after publication