CN109088794A - A kind of fault monitoring method and device of node - Google Patents
A kind of fault monitoring method and device of node Download PDFInfo
- Publication number
- CN109088794A CN109088794A CN201810950141.8A CN201810950141A CN109088794A CN 109088794 A CN109088794 A CN 109088794A CN 201810950141 A CN201810950141 A CN 201810950141A CN 109088794 A CN109088794 A CN 109088794A
- Authority
- CN
- China
- Prior art keywords
- node
- heartbeat
- heartbeat message
- message
- inspecting module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
Abstract
The embodiment of the invention discloses a kind of fault monitoring method of node and devices.This method comprises: receiving the first heartbeat message and the second heartbeat message that second node is sent by the heartbeat inspecting module of first node, the first heartbeat message and the second heartbeat message are caused by the adjacent heartbeat twice of second node;Then, which calculates the first time interval between the first heartbeat message and the second heartbeat message;If first time interval is greater than preset time threshold, heartbeat inspecting module can determine second node failure.In this way, remove to handle the heartbeat message of other nodes transmission by heartbeat inspecting module independent on node, complete the state of other nodes of monitoring, it overcomes with heartbeat message and service message using same processing unit, cause to be led to the problem of node failure monitoring erroneous judgement by the interference that service message is handled, the accuracy for improving node failure monitoring, to improve the reliability and safety of distributed memory system.
Description
Technical field
The present invention relates to technical field of distributed memory, more particularly to the fault monitoring method and device of a kind of node.
Background technique
With the arrival of information age, it is contemplated that the safety and reliability of information, traditional centrally stored system (that is,
By the storage system that all data are stored together) it will be unable to meet demand, in this way, data are respectively stored in more independences
Storage server on distributed memory system applied with regard to more and more extensive.Distributed memory system is deposited using more
Storage server shares storage load, positions storage information using location server, and the reliability of storage system can be improved, can be used
Property and access efficiency, are also easy to extend.
Each node carries out mutual failure prison by the way of sending and receiving heartbeat message in distributed memory system
It surveys, that is, each node is constantly sent out heartbeat message, and other nodes will receive heartbeat message and judge whether node is sent out
Failure is sent, as long as being more than the heartbeat message that specific duration is not received by some node, can determine the node failure.But
It is that a node is generally only configured with an event loop processing center, which not only handles heartbeat and disappear
Breath, also processing various businesses message (such as: the recovery of database, the inquiry of data, the service messages such as node election), as long as certain
A service message handles overlong time, it is possible to the transmission of heartbeat message can be blocked, so that the erroneous judgement to the node state is caused,
That is, mistake thinks that the node has broken down, and then influence distributed memory system normal work.
Therefore, in order to improve the accuracy and reliability of distributed memory system, urgently providing one kind at present can be to avoid
The node fault monitoring method of heartbeat message erroneous judgement, to improve the reliability of the distributed memory system.
Summary of the invention
Technical problems to be solved in this application are to provide the fault monitoring method and device of a kind of node, so that score
Each node in cloth storage system breaks down at any time, can be monitored, be avoided by heartbeat message
Failure is caused to be judged by accident since the processing time of service message is too long, so as to improve the reliability and peace of distributed memory system
Quan Xing.
In a first aspect, providing a kind of fault monitoring method of node, comprising:
The heartbeat inspecting module of first node receives the first heartbeat message and the second heartbeat message that second node is sent;Institute
It states the first heartbeat message and second heartbeat message is caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates the first time between first heartbeat message and second heartbeat message
Interval;
If the first time interval is greater than preset time threshold, the heartbeat inspecting module determines second section
Point failure.
Optionally, this method further include:
When the second node is determined breaking down, the heartbeat inspecting module sends service disconnection instruction, is used for
It indicates to interrupt the business with the second node;Also, service disconnection instruction is also used to trigger drift IP operation and more
New database operation.
Optionally, after the second node is determined breaking down, this method further include:
The heartbeat inspecting module of first node receives the third heartbeat message and the 4th heartbeat message that second node is sent;Institute
Stating third heartbeat message and the 4th heartbeat message is caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates second received between the third heartbeat message and the 4th heartbeat message
Time interval;
If second time interval is not more than the preset time threshold, described in the heartbeat inspecting module determines
Second node restores connection.
Optionally, this method further include:
When the second node is determined restoring connection, the heartbeat inspecting module sends business recovery instruction, is used for
It indicates to restore the business with the second node;Also, business recovery instruction is also used to trigger drift IP operation and more
New database operation.
Optionally, if the second node is host node, the service disconnection instruction/business recovery instruction is also used to
Triggering elects the operation of new host node.
Optionally, this method further include:
When increasing/deletion third node in the distributed memory system where the first node newly, then,
The heartbeat inspecting module of the first node receives newly-increased/deletion monitoring instruction;
The heartbeat inspecting module is according to the newly-increased/deletion monitoring instruction, the heart of the increase/deletion to the third node
The malfunction monitoring of hop-information.
Second aspect additionally provides a kind of fault monitoring device of node, comprising:
First receiving unit, the heartbeat inspecting module for first node receive the first heartbeat message that second node is sent
With the second heartbeat message;First heartbeat message and second heartbeat message are the adjacent heartbeats twice of the second node
It is generated;
First computing unit calculates first heartbeat message for the heartbeat inspecting module and second heartbeat disappears
First time interval between breath;
First determination unit, if being greater than preset time threshold, the heartbeat inspecting for the first time interval
Module determines the second node failure.
Optionally, the device further include:
First transmission unit, for when the second node is determined breaking down, the heartbeat inspecting module to be sent
Service disconnection instruction, is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to trigger
Drift about IP operation and update database manipulation.
Optionally, after the second node is determined breaking down, the device further include:
Second receiving unit, the heartbeat inspecting module for first node receive the third heartbeat message that second node is sent
With the 4th heartbeat message;The third heartbeat message and the 4th heartbeat message are the adjacent heartbeats twice of the second node
It is generated;
Second computing unit calculates for the heartbeat inspecting module and receives the third heartbeat message and the 4th heart
Jump the second time interval between message;
Second determination unit, if being not more than the preset time threshold, the heart for second time interval
It jumps monitoring modular and determines that the second node restores connection.
Optionally, the device further include:
Second transmission unit, for when the second node is determined restoring connection, the heartbeat inspecting module to be sent
Business recovery instruction, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to trigger
Drift about IP operation and update database manipulation.
Optionally, if the second node is host node, the service disconnection instruction/business recovery instruction is also used to
Triggering elects the operation of new host node.
Optionally, the device further include:
Third receiving unit, for when newly-increased in the distributed memory system where the first node/deletion third section
Point, then, the heartbeat inspecting module of the first node receive newly-increased/deletion monitoring instruction;
Monitoring unit, for the heartbeat inspecting module according to the newly-increased/deletion monitoring instruction, increase/deletion is to institute
State the malfunction monitoring of the heartbeat message of third node.
In the embodiment of the present application, it is sent it is possible, firstly, to receive second node by the heartbeat inspecting module of first node
The first heartbeat message and the second heartbeat message, first heartbeat message and second heartbeat message are the second nodes
Adjacent heartbeat twice caused by;Then, which calculates first heartbeat message and second heartbeat
First time interval between message;If the first time interval is greater than preset time threshold, the heartbeat inspecting mould
Block can determine the second node failure.In this way, going to handle other nodes hair by heartbeat inspecting module independent on node
The heartbeat message sent overcomes with heartbeat message and service message to go to monitor whether other nodes send failure using same
One processing unit causes to be led to the problem of node failure monitoring erroneous judgement by the interference that service message is handled, improves node
The accuracy of malfunction monitoring, to improve the reliability and safety of distributed memory system.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations as described in this application
Example, for those of ordinary skill in the art, is also possible to obtain other drawings based on these drawings.
Fig. 1 is network system block schematic illustration involved in an application scenarios in the embodiment of the present invention;
Fig. 2 is a kind of flow diagram of the fault monitoring method of node provided by the embodiments of the present application;
Fig. 3 is the structural schematic diagram of heartbeat inspecting module provided by the embodiments of the present application;
Fig. 4 is the flow diagram of the fault monitoring method of another node provided by the embodiments of the present application;
Fig. 5 is the flow diagram of the fault monitoring method of another node provided by the embodiments of the present application;
Fig. 6 is a kind of structural schematic diagram of the fault monitoring device of node provided by the embodiments of the present application.
Specific embodiment
Data are respectively stored in more independent storage servers by distributed memory system, are taken using more storages
Business device shares storage load, positions storage information using location server, can be improved the reliability of storage system, availability and
Access efficiency is also easy to extend, the extensive favor by user.The malfunction monitoring of each node, is adopted in distributed memory system
It is carried out with the mode for sending and receiving heartbeat message.But a node due to only be configured with an event loop processing center,
For handling heartbeat message and various businesses message, then, as long as there is time delay in some service message processing time, it is possible to meeting
The transmission for blocking heartbeat message, to cause the erroneous judgement to the node state, that is, mistake thinks that the node has broken down,
And then influence distributed memory system normal work.
Based on this, in order to improve the accuracy and reliability of distributed memory system, the embodiment of the invention provides one kind
The node fault monitoring method that can be judged by accident to avoid heartbeat message: it is possible, firstly, to which the heartbeat inspecting module by first node connects
Receive the first heartbeat message and the second heartbeat message that second node is sent, first heartbeat message and second heartbeat message
It is caused by the adjacent heartbeat twice of the second node;Then, which calculates first heartbeat message
First time interval between second heartbeat message;If the first time interval is greater than preset time threshold,
The heartbeat inspecting module can determine the second node failure.In this way, being gone by heartbeat inspecting module independent on node
The heartbeat message for handling the transmission of other nodes, to go to monitor whether other nodes send failure, overcome with heartbeat message with
Service message uses same processing unit, causes to be generated asking for node failure monitoring erroneous judgement by the interference that service message is handled
Topic improves the accuracy of node failure monitoring, to improve the reliability and safety of distributed memory system.
For example, one of the scene of the embodiment of the present invention, can be applied in scene as shown in Figure 1.The scene
In distributed memory system can be CTDB, wherein the CTDB be a cluster TDB database, can by Samba or its
He using carrying out storing data.If an application is temporarily to store data using TDB, this application can be very
Easily cluster mode is extended to using CTDB.CTDB provides function interface identical with TDB, and is building in more physics
Cluster on machine.For the distributed memory system, each node can run CTDB service, can externally provide void
Quasi- IP, ensures the orderly storage process of the distributed memory system.
The distributed memory system 100 may include node 110, node 120 ..., node 1N0 (wherein, N be greater than etc.
In 2), wherein for node 110, comprising: server 111, heartbeat inspecting module 112 and CTDB main thread 113;For node
120, including server 121, heartbeat inspecting module 122 and CTDB main thread 123;Similarly, for node 1N0, including server
1N1, heartbeat inspecting module 1N2 and CTDB main thread 1N3.
As an example, the heartbeat inspecting module 112 of the node 110 can receive other nodes (including node
120, node 1N0) heartbeat message that sends, and handle the heartbeat message received, specifically include: judgement receives same
Whether the interval time between adjacent heartbeat message twice that one node 1N0 is sent is in preset time threshold, if surpassed
The time threshold out, then the node 110 can then determine that failure has occurred in node 1N0.
It is understood that above-mentioned scene is only a Sample Scenario provided in an embodiment of the present invention, the embodiment of the present invention
It is not limited to this scene.
With reference to the accompanying drawing, by embodiment come the fault monitoring method and dress of the present invention will be described in detail embodiment interior joint
The specific implementation set.
Referring to fig. 2, a kind of fault monitoring method of node provided by the embodiments of the present application is shown.This method for example can be with
Include:
Step 201, the heartbeat inspecting module of first node receives the first heartbeat message and second heart that second node is sent
Jump message;First heartbeat message and second heartbeat message are produced by the adjacent heartbeat twice of the second node
's.
It is understood that each node can pass through heartbeat inspecting thereon for distributed memory system
Module sends heartbeat message, also, the heartbeat inspecting module transmission of other nodes can also be received by heartbeat inspecting module
Heartbeat message determines event in this way, each node can be monitored the state of other nodes by the rule of heartbeat message
The node of barrier, to ensure the reliability service of the distributed memory system.It is supervised in the present embodiment with first node by heartbeat
It is illustrated for the state of survey module monitors second node, in fact, second node can also be supervised by heartbeat inspecting module
The state of first node is surveyed, the method that other nodes carry out malfunction monitoring using the embodiment of the present invention each other may refer to this
The description of embodiment, repeats no more.
Wherein, which can be any one node in distributed memory system, including heartbeat inspecting module,
For receiving the heartbeat message of other nodes transmission.And heartbeat message, refer to node in the state of normal work, periodically outward
The message of transmission, for embodying the state of the node.Second node is also possible to be different from first in distributed memory system
The node of node, the second node can constantly issue heartbeat message by heartbeat inspecting module thereon.
It is the structural schematic diagram of heartbeat inspecting module provided by the embodiment of the present invention referring to Fig. 3.The heartbeat inspecting module
300 include external communication module 301 and internal communication module 302.
Wherein, external communication module 301 is responsible for sending and receiving the heartbeat message of other nodes;Internal communication module 302
It is responsible for the host process (such as CTDB host process) internal with it to be communicated, informs the current heartbeat inspecting result of the host process.
When specific implementation, second node can be in the external communication in heartbeat inspecting module of first moment to first node
Module send the first heartbeat message, and the external communication module in heartbeat inspecting module of second moment to the first node after
Supervention send the second heartbeat message, wherein the second node does not issue any one heart again between the first moment and the second moment
Jump message, that is, first heartbeat message and second heartbeat message are that the adjacent heartbeat twice of the second node is produced
Raw.So, the external communication module in the heartbeat inspecting module of first node can successively receive the first heartbeat message and
Second heartbeat message.
In addition, the heartbeat inspecting module of the first node can also obtain the first heartbeat message and the second heartbeat message respectively
Corresponding sending instant or the time of reception, specifically, the heartbeat inspecting module of the first node can lead in a kind of situation
The first heartbeat message that parsing receives is crossed, obtains the first sending instant being carried in first heartbeat message, that is, the second section
At the time of point sends first heartbeat message;Similarly, the heartbeat inspecting module of the first node can also be received by parsing
The second heartbeat message, obtain the second sending instant for being carried in second heartbeat message, that is, second node send this second
At the time of heartbeat message.In another case, the heartbeat inspecting module of the first node can also receive first heart by record
At the time of jumping message, using the moment as first time of reception;Similarly, the heartbeat inspecting module of the first node can also pass through
Record receive the and at the time of heartbeat message, using the moment as second time of reception.
It is understood that when obtaining first heartbeat message and the corresponding sending instant of the second heartbeat message or receiving
It carves, calculates first time interval for subsequent step 202 and be ready.
Step 202, heartbeat inspecting module, which calculates, receives between first heartbeat message and second heartbeat message
First time interval.
It, can be by the external communication module or internal communication module of heartbeat inspecting module, to what is got when specific implementation
Perhaps the time of reception calculates transmission or receives first heart for first heartbeat message and the corresponding sending instant of the second heartbeat message
The time interval between message and the second heartbeat message is jumped, first time interval is denoted as, can indicate a heart beat cycle.
As an example, when heartbeat monitoring modular obtains the first sending instant T of first heartbeat messageS1With second heart
Jump the second sending instant T of messageS2, it is possible to the second sending instant TS2Subtract the first sending instant TS1, obtain this
One sending instant to the second sending instant duration T experiencedS, it is denoted as first time interval.
As another example, when heartbeat monitoring modular obtains the first time of reception T of first heartbeat messageR1With second
Second time of reception T of heartbeat messageR2, it is possible to the second time of reception TR2Subtract the first time of reception TR1, it is somebody's turn to do
First time of reception to second time of reception duration T experiencedR, it is denoted as first time interval.
Step 203, if the first time interval is greater than preset time threshold, the heartbeat inspecting module determines institute
State second node failure.
It is understood that time threshold, can be the maximum permissible value of pre-set heart beat cycle, that is, when first
First time interval between heartbeat message and the second heartbeat message is greater than the time threshold, then it represents that the heartbeat of the second node
Message sends exception, may thereby determine that the second node failure;Otherwise, when between the first heartbeat message and the second heartbeat message
First time interval step be greater than the time threshold, then it represents that the heartbeat message of the second node send it is normal, so as to after
Continue the malfunction monitoring of next heart beat cycle.
When specific implementation, by the external communication module or the judgement of internal communication module in the heartbeat inspecting module of first node
Time threshold and the size for calculating resulting first time interval;When the heartbeat inspecting module determines that the first time interval is greater than
Time threshold then determines the second node failure;When the heartbeat inspecting module determines that the first time interval is less than or equal to the time
Threshold value then determines that the second node is normal.
It should be noted that foregoing description is only after sending failure with second node, sending the second heartbeat message can be produced
The case where giving birth to too long time delay, then, if no longer generating the second heartbeat message, Ye Ji after second node sends failure
When the heartbeat inspecting module of one node will not receive the second heartbeat message, then, first time interval by approach infinity,
Malfunction monitoring can be carried out through this embodiment, determine the second node failure.
As a possible example, when the second node is determined breaking down, the heartbeat inspecting module hair
It send service disconnection to instruct, is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to touch
Hair drift IP operation and update database manipulation.
Wherein, service disconnection instruction specifically can be the internal communication module of the heartbeat inspecting module of first node internally
Host process send.It not only may include the mark of the malfunctioning node of business to be interrupted in service disconnection instruction, that is, second
The mark of node can also include that drift IP operates corresponding operational order and updates the corresponding operation life of database manipulation
It enables.Therefore, after receiving service disconnection instruction, it can know the target that specifically failure has occurred, needs to interrupt related service
Node, and the operation of the corresponding drift IP and more new database of operation command can be triggered.
It should be noted that if the second node is host node, hair election can also be wrapped in service disconnection instruction
The corresponding operational order of the operation of new host node out, for triggering execution operational order pair when second node breaks down
The operation of the new host node of the election answered.
After step 203, that is, after determining second node failure, in order to replacement after failure or reparation
Second node realize it is flexible restore, improve resource utilization, then, as shown in figure 4, node provided in an embodiment of the present invention
Fault monitoring method can also include:
Step 404, the heartbeat inspecting module of first node receives the third heartbeat message and the 4th heart that second node is sent
Jump message;The third heartbeat message and the 4th heartbeat message are produced by the adjacent heartbeat twice of the second node
's;
Step 405, the heartbeat inspecting module calculate receive the third heartbeat message and the 4th heartbeat message it
Between the second time interval;
Step 406, if second time interval is not more than the preset time threshold, the heartbeat inspecting module
Determine that the second node restores connection.
It is understood that the third heartbeat message and the 4th heartbeat message may refer to the first heartbeat message and second heart
The associated description of message is jumped, the calculation of the second time interval also may refer to the calculation of first time interval, here
It repeats no more.
It should be noted that after determining second node failure, if the outside in the heartbeat inspecting module of first node
Communication module receives two heartbeat message caused by the adjacent heartbeat twice of second node in a certain period of time again, that is, the
Three heartbeat message and the 4th heartbeat message, then, the heartbeat inspecting module of the first node may refer to the mode meter of step 202
Calculate the second time interval in step 405.
When the second time interval is not more than preset time threshold, illustrate that the heartbeat message of the second node has restored just
Normal state, then, the heartbeat inspecting module of the first node can determine the recovered connection status of the second node.It is no
Then, when preset time threshold is still greater than in the second time interval, illustrate that the heartbeat message of the second node still has event
Barrier.
As a possible example, when the second node is determined restoring connection, the heartbeat inspecting module hair
It send business recovery to instruct, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to touch
Hair drift IP operation and update database manipulation.
Wherein, business recovery instruction specifically can be the internal communication module of the heartbeat inspecting module of first node internally
Host process send.It not only may include the mark of the malfunctioning node of business to be restored in business recovery instruction, that is, second
The mark of node can also include that drift IP operates corresponding operational order and updates the corresponding operation life of database manipulation
It enables.Therefore, after receiving business recovery instruction, it can know the target for being specifically interrupted business, needing to restore related service
Node, and the operation of the corresponding drift IP and more new database of operation command can be triggered.
It should be noted that if before second node failure being host node, hair can also be wrapped in business recovery instruction
The corresponding operational order of operation of new host node is elected, for the triggering execution operation life when second node restores to connect
The operation for the host node for enabling corresponding election new.
In addition, in order to adapt to the monitoring of the node failure of the distributed memory system changeable to structure, can also to newly-increased or
The node that person deletes is monitored, and in real time by node updates that are newly-increased or deleting into database, more smart to complete
True and reliable malfunction monitoring.When specific implementation, as shown in figure 5, the embodiment of the present invention can also include:
Step 507, when newly-increased in the distributed memory system where the first node/deletion third node, then, described
The heartbeat inspecting module of first node receives newly-increased/deletion monitoring instruction;
Step 508, the heartbeat inspecting module is according to the newly-increased/deletion monitoring instruction, and increase/deletion is to the third
The malfunction monitoring of the heartbeat message of node.
It is understood that when increasing or deleting third node in distributed memory system newly, then, main thread can touch
Send out the heartbeat inspecting module of on/off corresponding node;And to the heartbeat for notifying other nodes in the distributed memory system
Monitoring modular increases/deletes monitoring instruction newly, is used to indicate the increase of these nodes or deletion to the heartbeat inspecting of third node.
As an example, when having increased third node in distributed memory system newly, at this point, main thread can trigger starting
The heartbeat inspecting module of the third node, and send newly-increased monitoring instruction to the heartbeat inspecting module of first node, wherein can be with
The mark (that is, mark of third node) for carrying newly-increased node, for notifying the heartbeat inspecting module of first node to increase to this
The heartbeat inspecting of third node.Hereafter, the heartbeat inspecting module of third node can periodically issue heartbeat message, and first segment
The heartbeat message that the heartbeat inspecting module of point can also be issued by receiving the third node, the shape of the real-time monitoring third node
Whether state breaks down.
As another example, when deleting third node in distributed memory system, at this point, main thread can trigger pass
The heartbeat inspecting module of the third node is closed, and is sent to the heartbeat inspecting module of first node and deletes monitoring instruction, wherein can
To carry the mark (that is, mark of third node) of deletion of node, for notifying the heartbeat inspecting module deletion pair of first node
The heartbeat inspecting of the third node.Hereafter, the heartbeat inspecting module of third node will not then issue heartbeat message, and first node
Heartbeat inspecting module also just no longer need to the state of the real-time monitoring third node and whether break down.It should be noted that working as
The operation that " the heartbeat inspecting module that the third node is closed in triggering " when deleting third node, can not also be executed, even if third
The heartbeat inspecting module of node still can periodically issue heartbeat message, and the same first node can also be supervised according to the deletion
Control instruction, shields the heartbeat message of the third node, and then do not go to monitor the state of the third node.
As it can be seen that in the embodiment of the present application, it is possible, firstly, to which the heartbeat inspecting module by first node receives second node
The first heartbeat message and the second heartbeat message sent, first heartbeat message and second heartbeat message are described second
Caused by the adjacent heartbeat twice of node;Then, which calculates first heartbeat message and described second
First time interval between heartbeat message;If the first time interval is greater than preset time threshold, the heartbeat prison
The second node failure can be determined by surveying module.In this way, going to handle other sections by heartbeat inspecting module independent on node
The heartbeat message that point is sent overcomes and is adopted with heartbeat message and service message to go to monitor other nodes whether send failure
With same processing unit, causes to be led to the problem of node failure monitoring erroneous judgement by the interference that service message is handled, improve
The accuracy of node failure monitoring, to improve the reliability and safety of distributed memory system.
Correspondingly, referring to Fig. 6, the device is specific the embodiment of the present application also provides a kind of fault monitoring device of node
May include:
First receiving unit 601, the heartbeat inspecting module for first node receive the first heartbeat that second node is sent
Message and the second heartbeat message;First heartbeat message and second heartbeat message be the second node it is adjacent twice
Caused by heartbeat;
First computing unit 602 calculates first heartbeat message and second heart for the heartbeat inspecting module
Jump the first time interval between message;
First determination unit 603, if being greater than preset time threshold, the heartbeat prison for the first time interval
It surveys module and determines the second node failure.
Optionally, the device further include:
First transmission unit, for when the second node is determined breaking down, the heartbeat inspecting module to be sent
Service disconnection instruction, is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to trigger
Drift about IP operation and update database manipulation.
Optionally, after the second node is determined breaking down, the device further include:
Second receiving unit, the heartbeat inspecting module for first node receive the third heartbeat message that second node is sent
With the 4th heartbeat message;The third heartbeat message and the 4th heartbeat message are the adjacent heartbeats twice of the second node
It is generated;
Second computing unit calculates for the heartbeat inspecting module and receives the third heartbeat message and the 4th heart
Jump the second time interval between message;
Second determination unit, if being not more than the preset time threshold, the heart for second time interval
It jumps monitoring modular and determines that the second node restores connection.
Optionally, the device further include:
Second transmission unit, for when the second node is determined restoring connection, the heartbeat inspecting module to be sent
Business recovery instruction, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to trigger
Drift about IP operation and update database manipulation.
Optionally, if the second node is host node, the service disconnection instruction/business recovery instruction is also used to
Triggering elects the operation of new host node.
Optionally, the device further include:
Third receiving unit, for when newly-increased in the distributed memory system where the first node/deletion third section
Point, then, the heartbeat inspecting module of the first node receive newly-increased/deletion monitoring instruction;
Monitoring unit, for the heartbeat inspecting module according to the newly-increased/deletion monitoring instruction, increase/deletion is to institute
State the malfunction monitoring of the heartbeat message of third node.
Foregoing description is a kind of associated description of the fault monitoring device of node, wherein specific implementation and is reached
Effect, may refer to a kind of description of the fault monitoring method embodiment of node shown in Fig. 2, which is not described herein again.
" first " in the titles such as " the first heartbeat message " mentioned in the embodiment of the present invention, " first node " is used only to
Name mark is done, first sequentially is not represented.The rule is equally applicable to " second " etc..
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation
All or part of the steps in example method can add the mode of general hardware platform to realize by software.Based on this understanding,
Technical solution of the present invention can be embodied in the form of software products, which can store is situated between in storage
In matter, such as read-only memory (English: read-only memory, ROM)/RAM, magnetic disk, CD etc., including some instructions to
So that a computer equipment (can be the network communication equipments such as personal computer, server, or router) executes
Method described in certain parts of each embodiment of the present invention or embodiment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.Method and apparatus embodiment described above is only schematical, wherein saying as separation unit
Bright module may or may not be physically separated, and the component shown as module can be or can not also
It is physical module, it can it is in one place, or may be distributed over multiple network units.It can be according to actual need
Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying
Out in the case where creative work, it can understand and implement.
The above is only a preferred embodiment of the present invention, it is not intended to limit the scope of the present invention.It should refer to
Out, for those skilled in the art, under the premise of not departing from the present invention, can also make several improvements
And retouching, these modifications and embellishments should also be considered as the scope of protection of the present invention.
Claims (10)
1. a kind of fault monitoring method of node characterized by comprising
The heartbeat inspecting module of first node receives the first heartbeat message and the second heartbeat message that second node is sent;Described
Heartbeat message and second heartbeat message are caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates the first time interval between first heartbeat message and second heartbeat message;
If the first time interval is greater than preset time threshold, the heartbeat inspecting module determines the second node event
Barrier.
2. the method according to claim 1, wherein further include:
When the second node is determined breaking down, the heartbeat inspecting module sends service disconnection instruction, is used to indicate
Interrupt the business with the second node;Also, the service disconnection instruction is also used to trigger drift IP operation and updates number
It is operated according to library.
3. the method according to claim 1, wherein also being wrapped after the second node is determined breaking down
It includes:
The heartbeat inspecting module of first node receives the third heartbeat message and the 4th heartbeat message that second node is sent;Described
Three heartbeat message and the 4th heartbeat message are caused by the adjacent heartbeat twice of the second node;
The heartbeat inspecting module calculates the second time received between the third heartbeat message and the 4th heartbeat message
Interval;
If second time interval is not more than the preset time threshold, the heartbeat inspecting module determines described second
Node restores connection.
4. according to the method described in claim 3, it is characterized by further comprising:
When the second node is determined restoring connection, the heartbeat inspecting module sends business recovery instruction, is used to indicate
Restore the business with the second node;Also, the business recovery instruction is also used to trigger drift IP operation and updates number
It is operated according to library.
5. method according to claim 2 or 4, which is characterized in that if the second node is host node, the industry
Business interrupt instruction/business recovery instruction is also used to trigger the operation for electing new host node.
6. method described in any one according to claim 1~5, which is characterized in that further include:
When increasing/deletion third node in the distributed memory system where the first node newly, then,
The heartbeat inspecting module of the first node receives newly-increased/deletion monitoring instruction;
The heartbeat inspecting module believes the heartbeat of the third node according to the newly-increased/deletion monitoring instruction, increase/deletion
The malfunction monitoring of breath.
7. a kind of fault monitoring device of node characterized by comprising
First receiving unit, the heartbeat inspecting module for first node receive the first heartbeat message and that second node is sent
Two heartbeat message;First heartbeat message and second heartbeat message are that the adjacent heartbeat twice of the second node is produced
Raw;
First computing unit, for the heartbeat inspecting module calculate first heartbeat message and second heartbeat message it
Between first time interval;
First determination unit, if being greater than preset time threshold, the heartbeat inspecting module for the first time interval
Determine the second node failure.
8. device according to claim 7, which is characterized in that further include:
First transmission unit, for when the second node is determined breaking down, the heartbeat inspecting module to send business
Interrupt instruction is used to indicate the business interrupted with the second node;Also, the service disconnection instruction is also used to trigger drift
IP operation and update database manipulation.
9. device according to claim 7, which is characterized in that after the second node is determined breaking down, also wrap
It includes:
Second receiving unit, the heartbeat inspecting module for first node receive the third heartbeat message and that second node is sent
Four heartbeat message;The third heartbeat message and the 4th heartbeat message are that the adjacent heartbeat twice of the second node is produced
Raw;
Second computing unit calculates the reception third heartbeat message for the heartbeat inspecting module and the 4th heartbeat disappears
The second time interval between breath;
Second determination unit, if being not more than the preset time threshold, the heartbeat prison for second time interval
It surveys module and determines that the second node restores connection.
10. device according to claim 9, which is characterized in that further include:
Second transmission unit, for when the second node is determined restoring connection, the heartbeat inspecting module to send business
Restore instruction, is used to indicate the business restored with the second node;Also, the business recovery instruction is also used to trigger drift
IP operation and update database manipulation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810950141.8A CN109088794A (en) | 2018-08-20 | 2018-08-20 | A kind of fault monitoring method and device of node |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810950141.8A CN109088794A (en) | 2018-08-20 | 2018-08-20 | A kind of fault monitoring method and device of node |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109088794A true CN109088794A (en) | 2018-12-25 |
Family
ID=64793893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810950141.8A Pending CN109088794A (en) | 2018-08-20 | 2018-08-20 | A kind of fault monitoring method and device of node |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109088794A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110611603A (en) * | 2019-09-09 | 2019-12-24 | 苏州浪潮智能科技有限公司 | Cluster network card monitoring method and device |
CN111586110A (en) * | 2020-04-22 | 2020-08-25 | 广州锦行网络科技有限公司 | Optimization processing method for raft in point-to-point fault |
CN111651294A (en) * | 2020-05-13 | 2020-09-11 | 浙江华创视讯科技有限公司 | Node abnormity detection method and device |
CN112118145A (en) * | 2019-06-19 | 2020-12-22 | 北京沃东天骏信息技术有限公司 | Node state monitoring method, control device and monitoring device |
CN112671603A (en) * | 2020-12-15 | 2021-04-16 | 中国联合网络通信集团有限公司 | Fault detection method and server |
CN112865993A (en) * | 2019-11-27 | 2021-05-28 | 上海哔哩哔哩科技有限公司 | Method and device for switching slave nodes in distributed master-slave system |
CN112988463A (en) * | 2021-02-23 | 2021-06-18 | 新华三大数据技术有限公司 | Fault node isolation method and device |
CN113114535A (en) * | 2021-04-12 | 2021-07-13 | 北京字跳网络技术有限公司 | Network fault detection method and device and electronic equipment |
WO2021249173A1 (en) * | 2020-06-12 | 2021-12-16 | 华为技术有限公司 | Distributed storage system, abnormality processing method therefor, and related device |
CN114007246A (en) * | 2021-10-29 | 2022-02-01 | 北京天融信网络安全技术有限公司 | Method, apparatus, computer device and medium for reducing network congestion |
CN114750774A (en) * | 2021-12-20 | 2022-07-15 | 广州汽车集团股份有限公司 | Safety monitoring method and automobile |
CN115333983A (en) * | 2022-08-16 | 2022-11-11 | 超聚变数字技术有限公司 | Heartbeat management method and node |
CN116684256A (en) * | 2023-08-01 | 2023-09-01 | 苏州浪潮智能科技有限公司 | Node fault monitoring method, device and system, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101989999A (en) * | 2010-11-12 | 2011-03-23 | 华中科技大学 | Hierarchical storage system in distributed environment |
CN102394914A (en) * | 2011-09-22 | 2012-03-28 | 浪潮(北京)电子信息产业有限公司 | Cluster brain-split processing method and device |
CN103117901A (en) * | 2013-02-01 | 2013-05-22 | 华为技术有限公司 | Distributed heartbeat detection method, device and system |
CN105245531A (en) * | 2015-10-21 | 2016-01-13 | 北京捷思锐科技股份有限公司 | Disconnection detection method, device and server |
CN105446852A (en) * | 2014-09-28 | 2016-03-30 | 中国航空工业集团公司西安飞机设计研究所 | High reliability cascaded heartbeat design method |
CN107360239A (en) * | 2017-07-25 | 2017-11-17 | 郑州云海信息技术有限公司 | A kind of client connection status detection method and system |
-
2018
- 2018-08-20 CN CN201810950141.8A patent/CN109088794A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101989999A (en) * | 2010-11-12 | 2011-03-23 | 华中科技大学 | Hierarchical storage system in distributed environment |
CN102394914A (en) * | 2011-09-22 | 2012-03-28 | 浪潮(北京)电子信息产业有限公司 | Cluster brain-split processing method and device |
CN103117901A (en) * | 2013-02-01 | 2013-05-22 | 华为技术有限公司 | Distributed heartbeat detection method, device and system |
CN105446852A (en) * | 2014-09-28 | 2016-03-30 | 中国航空工业集团公司西安飞机设计研究所 | High reliability cascaded heartbeat design method |
CN105245531A (en) * | 2015-10-21 | 2016-01-13 | 北京捷思锐科技股份有限公司 | Disconnection detection method, device and server |
CN107360239A (en) * | 2017-07-25 | 2017-11-17 | 郑州云海信息技术有限公司 | A kind of client connection status detection method and system |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112118145A (en) * | 2019-06-19 | 2020-12-22 | 北京沃东天骏信息技术有限公司 | Node state monitoring method, control device and monitoring device |
CN110611603B (en) * | 2019-09-09 | 2021-08-31 | 苏州浪潮智能科技有限公司 | Cluster network card monitoring method and device |
CN110611603A (en) * | 2019-09-09 | 2019-12-24 | 苏州浪潮智能科技有限公司 | Cluster network card monitoring method and device |
CN112865993B (en) * | 2019-11-27 | 2022-10-14 | 上海哔哩哔哩科技有限公司 | Method and device for switching slave nodes in distributed master-slave system |
CN112865993A (en) * | 2019-11-27 | 2021-05-28 | 上海哔哩哔哩科技有限公司 | Method and device for switching slave nodes in distributed master-slave system |
CN111586110A (en) * | 2020-04-22 | 2020-08-25 | 广州锦行网络科技有限公司 | Optimization processing method for raft in point-to-point fault |
CN111651294A (en) * | 2020-05-13 | 2020-09-11 | 浙江华创视讯科技有限公司 | Node abnormity detection method and device |
CN113805788A (en) * | 2020-06-12 | 2021-12-17 | 华为技术有限公司 | Distributed storage system and exception handling method and related device thereof |
WO2021249173A1 (en) * | 2020-06-12 | 2021-12-16 | 华为技术有限公司 | Distributed storage system, abnormality processing method therefor, and related device |
CN113805788B (en) * | 2020-06-12 | 2024-04-09 | 华为技术有限公司 | Distributed storage system and exception handling method and related device thereof |
CN112671603A (en) * | 2020-12-15 | 2021-04-16 | 中国联合网络通信集团有限公司 | Fault detection method and server |
CN112988463A (en) * | 2021-02-23 | 2021-06-18 | 新华三大数据技术有限公司 | Fault node isolation method and device |
CN112988463B (en) * | 2021-02-23 | 2022-08-30 | 新华三大数据技术有限公司 | Fault node isolation method and device |
CN113114535A (en) * | 2021-04-12 | 2021-07-13 | 北京字跳网络技术有限公司 | Network fault detection method and device and electronic equipment |
CN114007246A (en) * | 2021-10-29 | 2022-02-01 | 北京天融信网络安全技术有限公司 | Method, apparatus, computer device and medium for reducing network congestion |
CN114007246B (en) * | 2021-10-29 | 2024-02-02 | 北京天融信网络安全技术有限公司 | Method, apparatus, computer device and medium for reducing network congestion |
CN114750774A (en) * | 2021-12-20 | 2022-07-15 | 广州汽车集团股份有限公司 | Safety monitoring method and automobile |
CN115333983A (en) * | 2022-08-16 | 2022-11-11 | 超聚变数字技术有限公司 | Heartbeat management method and node |
CN115333983B (en) * | 2022-08-16 | 2023-10-10 | 超聚变数字技术有限公司 | Heartbeat management method and node |
CN116684256A (en) * | 2023-08-01 | 2023-09-01 | 苏州浪潮智能科技有限公司 | Node fault monitoring method, device and system, electronic equipment and storage medium |
CN116684256B (en) * | 2023-08-01 | 2023-11-03 | 苏州浪潮智能科技有限公司 | Node fault monitoring method, device and system, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109088794A (en) | A kind of fault monitoring method and device of node | |
CN106170971B (en) | Arbitration process method, arbitration storage device and system after a kind of cluster fissure | |
CN105187249B (en) | A kind of fault recovery method and device | |
CN105095001B (en) | Virtual machine abnormal restoring method under distributed environment | |
US5805785A (en) | Method for monitoring and recovery of subsystems in a distributed/clustered system | |
CN104113428B (en) | A kind of equipment management device and method | |
CN107147540A (en) | Fault handling method and troubleshooting cluster in highly available system | |
CN109728981A (en) | A kind of cloud platform fault monitoring method and device | |
CN109308252A (en) | A kind of fault location processing method and processing device | |
CN109286529A (en) | A kind of method and system for restoring RabbitMQ network partition | |
CN108429629A (en) | Equipment fault restoration methods and device | |
EP3724761B1 (en) | Failure handling in a cloud environment | |
CN105245381B (en) | Cloud Server delay machine monitors migratory system and method | |
CN106789306A (en) | Restoration methods and system are collected in communication equipment software fault detect | |
TWI691852B (en) | Error detection device and error detection method for detecting failure of hierarchical system, computer-readable recording medium and computer program product | |
CN109101196A (en) | Host node switching method, device, electronic equipment and computer storage medium | |
Araujo et al. | Dependability evaluation of a mhealth system using a mobile cloud infrastructure | |
CN103490914A (en) | Switching system and switching method for multi-machine hot standby of network application equipment | |
CN104283780A (en) | Method and device for establishing data transmission route | |
CN107153595A (en) | The fault detection method and its system of distributed data base system | |
CN108293003A (en) | Distribution figure handles the fault-tolerant of network | |
CN107291589A (en) | Method for improving system reliability in robot operating system | |
CN108718398A (en) | Code stream transmission method, device and the conference facility of video conferencing system | |
CN110224872B (en) | Communication method, device and storage medium | |
US10164856B2 (en) | Reconciliation of asymmetric topology in a clustered environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181225 |
|
RJ01 | Rejection of invention patent application after publication |