CN100394730C - Method for rapidly detecting error in high availability system - Google Patents

Method for rapidly detecting error in high availability system Download PDF

Info

Publication number
CN100394730C
CN100394730C CNB2004100901589A CN200410090158A CN100394730C CN 100394730 C CN100394730 C CN 100394730C CN B2004100901589 A CNB2004100901589 A CN B2004100901589A CN 200410090158 A CN200410090158 A CN 200410090158A CN 100394730 C CN100394730 C CN 100394730C
Authority
CN
China
Prior art keywords
node
ham
high availability
network interface
management board
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100901589A
Other languages
Chinese (zh)
Other versions
CN1767455A (en
Inventor
费鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
UTStarcom Telecom Co Ltd
Original Assignee
UTStarcom Telecom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by UTStarcom Telecom Co Ltd filed Critical UTStarcom Telecom Co Ltd
Priority to CNB2004100901589A priority Critical patent/CN100394730C/en
Publication of CN1767455A publication Critical patent/CN1767455A/en
Application granted granted Critical
Publication of CN100394730C publication Critical patent/CN100394730C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention provides a method for rapidly detecting an error in a high available system. The system comprises an IP backboard bus line, a switch unit (SEM), one or a plurality of nodes (PPB) and a high available management board (HAM). The method comprises the following steps that (a) the switch unit (SEM) detects the physical state of network interfaces, wherein the network interfaces are respectively connected with each node (PPB) through the IP backboard bus line; (b) when the switch unit (SEM) detects that the physical state of the network interfaces is changed from a first predetermined state to a second predetermined state, the switch unit (SEM) sends a predetermined signal to the high available management board (HAM) through the individual control channel; (c) the high available management board (HAM) executes predetermined operation to the nodes according to the received predetermined signal.

Description

The method of fast detecting mistake in the high-availability system
Technical field
The present invention relates to the method for fast detecting mistake in the high-availability system.
Background technology
The system that connects by network makes resource and the service of can working together between each node in the system can provide professional synchronously jointly by internodal.A telecommunication system schematic diagram shown in Figure 1 is exactly the system that connects by network.(High Availability Manager HAM) forms system by the IP core bus through crosspoint with feature board 3 nodes such as grade to interior feature board 1, feature board 2, the high availability management board of machine frame.Can intercom mutually by the exchange of crosspoint between node.
Telecommunication system reaches 99.999% to the availability requirement of system, during promptly 1 year, can not provide the time of service must be less than 315.36 seconds because of system makes a mistake.Be to guarantee the high availability of system, the node that HAM is managed in should be able to rapid detection system wrong also carries out mistake recovery.
Heartbeat detection mechanism is the important method of detection node mistake.After heartbeat detection mechanism is the registration message of HAM node in receiving the system that is managed, periodically send heartbeat message to this node, node after receiving heartbeat message to HAM back response message.If the response message of not receiving the node loopback that HAM is continuous just thinks that node makes a mistake, need carry out mistake and recover.Respectively to the HAM registration, HAM regularly sent heartbeat message to feature board 1 after for example feature board 1 started.If feature board 1 does not have back response message continuous 3 times, HAM can think that mistake has taken place feature board 1.HAM can do wrong the recovery by hardware reset feature board 1, and the back feature board 1 that resets can restart and register.
Existing heartbeat detection mechanism is under the condition of exception error (referring to the description that the sees below) generation of some types, can not when taking place, this mistake fast detecting make mistake, but must just can detect mistake, thereby reduced the high availability of system through the continuous heartbeat detection cycle.
Summary of the invention
Compare with existing heartbeat detecting method, the present invention can be after the node of being managed takes place to restart unusually, the mistake of fast detecting egress, and then the high availability of raising system
The invention provides a kind of in high-availability system the method for fast detecting node mistake, described system comprises an IP core bus, described IP core bus is connected with a crosspoint (SEM), one or more node (PPB) and a high availability management board (HAM), described node is communicated by letter with described crosspoint (SEM) respectively with high availability management board (HAM), described high availability management board (HAM) is also communicated by letter with crosspoint SEM by an independent control channel, and described method comprises step:
(a) described crosspoint (SEM) detects the step of the physical state of its network interface, and wherein each network interface is connected with each node (PPB) by described IP core bus respectively;
(b) physical state that detects network interface when crosspoint is when first predetermined state is changed to second predetermined state, crosspoint (SEM) by institute's speed separately control channel to the step of a prearranged signals of high availability administrative unit (HAM) transmission;
(c) high availability management board (HAM) is carried out the step of a scheduled operation to described node according to the described prearranged signals that receives.
The present invention also provides a kind of high-availability system that can fast detecting node mistake, comprise an IP core bus, one or more nodes (PPB) that are connected to described IP core bus, a high availability management board (HAM) that is connected to described IP core bus, a crosspoint (SEM) that is connected to described IP core bus, described node can be communicated by letter with described crosspoint (SEM) respectively by the IP core bus with high availability management board (HAM), and described crosspoint (SEM) also has an independent control channel that is connected to high availability management board (HAM).Described crosspoint comprises: checkout gear, be used to detect the physical state of its network interface, and wherein each network interface is connected with each node (PPB) respectively; Dispensing device, be used for when checkout gear to the physical state of network interface when first predetermined state is changed to second predetermined state, send a prearranged signals by described independent control channel to high availability administrative unit (HAM); Described high availability management board comprises: control device is used for according to the described prearranged signals that receives described node being carried out a scheduled operation.
In addition, described high availability management board (HAM) carries out heartbeat detection to each node simultaneously, and described heartbeat detection comprises the steps:
(i) send heartbeat message to node,
(ii) when not receiving the answer of this node in a scheduled time, the cumulative number of replying does not increase by 1;
The (iii) number of times and predetermined maximum reattempt times of more not replying, if the number of times of Hui Fuing is not greater than described predetermined maximum reattempt times, described high availability management board is carried out described scheduled operation, if the number of times of Hui Fuing is not more than described predetermined maximum reattempt times, returns step (i);
If (iv) receive the answer of this node in a scheduled time, the cumulative number of Hui Fuing does not reset to zero, returns step (i) then.
Wherein, first predetermined state that described detection step detects is the running status of network interface, and second predetermined state is the halted state of network interface.
The scheduled operation that described high availability management board (HAM) is carried out is the error recovery operation to node.
The running status of described network interface, halted state are to obtain by mechanical property and the electrical characteristic that detects network interface.
Description of drawings
Fig. 1 shows the schematic diagram of a telecommunication system.
Fig. 2 shows wrong rapid detection system schematic diagram according to an embodiment of the invention.
Fig. 3 shows the structure chart according to a crosspoint SEM in the enforcement of the present invention.
Fig. 4 shows quick fault detection procedures according to an embodiment of the invention.
Embodiment
In the preferred embodiment below, will describe method provided by the invention in detail.
Fig. 2 is the system schematic of a preferred embodiment.A machine frame is arranged in the system shown in Figure 2, (1) IP core bus is arranged in the machine frame; (2) 4 nodes that connected by the IP core bus, PPB1, PPB2, PPB3 and PPBn; (3) crosspoints (SEM) are connected on the IP core bus; (4) HAM are connected with SEM by the IP core bus, also are connected with SEM by an independent control channel, and this independent control channel is the passage that a physics exists.
Node in machine frame can intercom mutually by the IP core bus, and typical IP core bus is 100 mbit ethernet circuits of starlike connection.
SEM is the forwarding capability of the unit layer 2 that can realize defining among the TCP/IP in return.
Fig. 3 has provided the structure chart of SEM.There is an exchange chip to realize the function of layer 2 on the SEM, is responsible between network interface, transmitting the packet of layer 2.There is a CPU can read the internal data of the state and the control exchange chip of exchange chip on the SEM by internal bus.CPU communicates by independent network interface of total line traffic control and HAM.
Connection by the IP core bus and the exchange of SEM are transmitted, HAM can and each PPB between communicate.After receiving the registration message that PPB sends, HAM can be enough be carried on heartbeat detection message monitoring PPB1 on the IP bag, PPB2, the PPB3 running status to PPBn.
Can detect the mistake that node takes place by prior heartbeat detection mechanism.The contingent mistake of assembly comprises software error, hardware error, operating system mistake and network error.The node that software error takes place may withdraw from unusually, or is trapped in the module of moving back to come out in the endless loop or responding heartbeat message and can not get cpu resource and can not respond heartbeat message.Hardware error comprises that the node input voltage shakes, and moment is lower than rated value.It is unusual that the operating system mistake comprises that CPU takes place.After hardware error and operating system mistake took place, the common treatment mechanism of hardware or operating system was the whole node that resets, and comprised the module of the response heartbeat message of operation on it.Common reset operation is a hardware reset, and the most direct performance that resets is exactly that node restarts unusually.Network error comprises that network interface breaks down, and under the condition of generation, node can not receive the heartbeat detection message that HAM sends, and also can not respond this message.
But prior heartbeat mechanism node make a mistake restart unusually after, can not fast detecting go out this mistake, thereby reduce the high availability of system.As mentioned above, the mistake of hardware and operating system can cause that node restarts, hand-reset node, software reset's node and network error all can cause node to restart.When HAM at node these mistakes takes place, still need this is sent heartbeat message, and on the basis of not returning continuously, could determine wrong generation.
Among the present invention, by the physical characteristic of port, HAM can fast detecting the restarting unusually of egress, thereby effectively improve the high availability of system.Exchange chip on the SEM can just detect this and change in the network interface physical state (PHY) that directly links to each other with it when change, and the CPU on the notice SEM.The state of PHY has operation (UP) and stops (DOWN) two states.When the state of PHY is the DOWN state, this network interface bag of can not receiving and sending messages.Node restart unusually or the network interface of node breaks down, the state of PHY can become DOWN by UP.CPU on the SEM reports HAM by the control channel that is connected to HAM.HAM determines that this is wrong and finishes follow-up mistake and recover action.
The physical characteristic of related network port comprises machinery and electrical characteristic, can be with reference to the IEEE802.3 protocol suite, and wherein the 23-26 in the IEEE802.3 agreement, 32,36 and 40 chapters are described machinery and the electrical characteristic of network interface physical state PHY in detail.And at 23.4.3.5 joint, 685 pages of agreement have been described the flow chart to the detection of far-end mistake.Those skilled in the art is according to being enough to according to the above-mentioned open detection that realizes the physical characteristic of the network port.
Fig. 4 has provided the flow process of the heartbeat detecting method of the present invention's description.As shown in the figure, HAM starts the heartbeat detection timer after receiving the registration message of node.A heartbeat detection timer in this moment heartbeat detection timer, one not round trip counter also are activated.
Then, heartbeat detection timer cycle ground sends heartbeat message to the node of this registration.At this moment, the zero clearing of described heartbeat detection timer, and pick up counting.Operation subsequently is as follows:
If HAM receives network interface state corresponding on the exchange chip that SEM reports and become the incident of DOWN, HAM just supposes that the network of node breaks down, and the absolutely not response heartbeat message of node must carry out mistake recovery to node.Need after the recovery " not round trip number " reset to zero.
The sequence of operations of definition when wherein, " error recovery operation " is design.For example, in the machine frame of the CompactPCI that typically meets the definition of PICMG tissue, mistake is recovered to comprise 1) from HAM hardware reset node (third party outside the node) this node that resets, reset from this node with the hardware of above-mentioned node and operating system mistake and to have any different, third-party resetting needs third party and node that the line of hardware reset pin is arranged, and in the CompactPCI standard this had definition; 2) comprise mode reporting system managers such as audible and visual alarm by various alarm modes, in HAM can not the system of hardware reset node, this mode was even more important, and needs this node of system operator manual operation, for example manually restarted and changed node.
If described node returns heartbeat message in a scheduled time, HAM just thinks the node operational excellence, will the not counter O reset of round trip number of record node.
If interior nodes is not returned heartbeat message at the fixed time, i.e. timer expired, HAM not round trip counter adds 1.
More not round trip number and maximum reattempt times then, if the round trip number is not greater than maximum reattempt times, though then HAM can suppose the meshed network operational excellence, node may have no chance to respond heartbeat message forever.Therefore HAM can carry out the mistake recovery to this node.Resetting not after the recovery, the round trip number is zero.
If network interface state corresponding on the exchange chip is UP, and the round trip number is not less than maximum reattempt times, and then the network operation of HAM supposition node is good, and just node does not obtain free time response heartbeat message.Therefore HAM can send heartbeat message once more and start the heartbeat detection timer.
Compare with existing heartbeat detecting method, the present invention can be after the node of being managed takes place to restart unusually, the mistake of fast detecting egress, and then the high availability of raising system.

Claims (8)

1. the method for a fast detecting node mistake in high-availability system, described system comprises an IP core bus, described IP core bus is connected with a crosspoint (SEM), one or more node (PPB) and a high availability management board (HAM), described node is communicated by letter with described crosspoint (SEM) respectively with high availability management board (HAM), described high availability management board (HAM) is also communicated by letter with crosspoint SEM by an independent control channel
Described method comprises step:
(a) described crosspoint (SEM) detects the physical state of its network interface, and wherein each network interface is connected with each node (PPB) respectively;
(b) physical state that detects network interface when crosspoint is when first predetermined state is changed to second predetermined state, and crosspoint (SEM) sends a prearranged signals by described independent control channel to high availability administrative unit (HAM);
(c) high availability management board (HAM) is carried out a scheduled operation according to the described prearranged signals that receives to described node;
First predetermined state that wherein said detection step detects is the running status of network interface, and second predetermined state is the halted state of network interface.
2. according to the method for claim 1, described high availability management board (HAM) carries out heartbeat detection to each node simultaneously, and described heartbeat detection comprises the steps:
(i) send heartbeat message to tested node,
(ii) when not receiving the answer of this node in a scheduled time, the cumulative number of replying does not increase by 1;
The (iii) number of times and predetermined maximum reattempt times of more not replying, if the number of times of Hui Fuing is not greater than described predetermined maximum reattempt times, described high availability management board is carried out described scheduled operation, if the number of times of Hui Fuing is not more than described predetermined maximum reattempt times, returns step (i);
If (iv) receive the answer of this node in a scheduled time, the cumulative number of Hui Fuing does not reset to zero, returns step (i) then.
3. according to the method for claim 1, the scheduled operation that described high availability management board (HAM) is carried out is the error recovery operation to node.
4. according to the method for claim 1, the running status of described network interface, halted state are to obtain by mechanical property and the electrical characteristic that detects network interface.
One kind can fast detecting node mistake high-availability system, comprise
An IP core bus,
One or more nodes (PPB) that are connected to described IP core bus,
A high availability management board (HAM) that is connected to described IP core bus,
A crosspoint (SEM) that is connected to described IP core bus, described node can be communicated by letter with described crosspoint (SEM) respectively by the IP core bus with high availability management board (HAM), described crosspoint (SEM) also has an independent control channel that is connected to high availability management board (HAM)
Described crosspoint comprises,
Checkout gear is used to detect the physical state of its network interface, and wherein each network interface is connected with each node (PPB) respectively;
Dispensing device, be used for when checkout gear to the physical state of network interface when first predetermined state is changed to second predetermined state, send a prearranged signals by described independent control channel to high availability administrative unit (HAM);
Described high availability management board comprises,
Control device is used for according to the described prearranged signals that receives described node being carried out a scheduled operation;
First predetermined state that wherein said detection step detects is the running status of network interface, and second predetermined state is the halted state of network interface.
6. according to the system of claim 5, described high availability management board also comprises a heartbeat detection timer, and described heartbeat detection timer carries out following operation:
I) send heartbeat message to tested node,
Ii) when not receiving the answer of this node in a scheduled time, the cumulative number of replying does not increase by 1;
The iii) number of times and predetermined maximum reattempt times of more not replying, if the number of times of Hui Fuing is not greater than described predetermined maximum reattempt times, described high availability management board is carried out described scheduled operation, if the number of times of Hui Fuing is not more than described predetermined maximum reattempt times, returns step (i);
If iv) receive the answer of this node in a scheduled time, the cumulative number of Hui Fuing does not reset to zero, returns step (i) then.
7. according to the system of claim 5, the scheduled operation that described high availability management board (HAM) is carried out is the error recovery operation to node.
8. according to the system of claim 5, the running status of described network interface, halted state are to obtain by mechanical property and the electrical characteristic that detects network interface.
CNB2004100901589A 2004-10-29 2004-10-29 Method for rapidly detecting error in high availability system Expired - Fee Related CN100394730C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100901589A CN100394730C (en) 2004-10-29 2004-10-29 Method for rapidly detecting error in high availability system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100901589A CN100394730C (en) 2004-10-29 2004-10-29 Method for rapidly detecting error in high availability system

Publications (2)

Publication Number Publication Date
CN1767455A CN1767455A (en) 2006-05-03
CN100394730C true CN100394730C (en) 2008-06-11

Family

ID=36743074

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100901589A Expired - Fee Related CN100394730C (en) 2004-10-29 2004-10-29 Method for rapidly detecting error in high availability system

Country Status (1)

Country Link
CN (1) CN100394730C (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101227393B (en) * 2008-01-08 2011-01-05 中兴通讯股份有限公司 Method and apparatus for anti-error connecting of switch stack
US9172120B2 (en) * 2010-07-14 2015-10-27 Sinoelectric Powertrain Corporation Battery pack fault communication and handling
CN103068095A (en) * 2011-10-21 2013-04-24 扬升照明股份有限公司 Illuminating system and control method thereof
CN105430701A (en) * 2015-10-28 2016-03-23 努比亚技术有限公司 Wireless network intelligent control method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09275412A (en) * 1996-04-05 1997-10-21 Hitachi Ltd Inter-network equipment and network system
JPH10271153A (en) * 1997-03-24 1998-10-09 Nec Corp Package fault information transfer system
JP2000269967A (en) * 1999-03-16 2000-09-29 Sony Corp Network system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09275412A (en) * 1996-04-05 1997-10-21 Hitachi Ltd Inter-network equipment and network system
JPH10271153A (en) * 1997-03-24 1998-10-09 Nec Corp Package fault information transfer system
JP2000269967A (en) * 1999-03-16 2000-09-29 Sony Corp Network system

Also Published As

Publication number Publication date
CN1767455A (en) 2006-05-03

Similar Documents

Publication Publication Date Title
CN101345663B (en) Heartbeat detection method and heartbeat detection apparatus
US7050390B2 (en) System and method for real-time fault reporting in switched networks
CN101355466B (en) Method and apparatus for transmitting continuous check information message
US7570580B1 (en) Automatic problem isolation for multi-layer network failures
US20050108389A1 (en) Network endpoint health check
CN101753379A (en) The system and method for fast detecting communication path failures
CN101188527B (en) A heartbeat detection method and device
CN101667941A (en) Method for detecting link performance and device therefor
US20080104285A1 (en) Method and system for monitoring device port
US20070022331A1 (en) Single-ended ethernet management system and method
JP2006501717A (en) Telecom network element monitoring
CN100394730C (en) Method for rapidly detecting error in high availability system
US8274904B1 (en) Method and apparatus for providing signature based predictive maintenance in communication networks
CN104639358B (en) batch network port switching method and switching system
EP1733506B1 (en) Fault management in an ethernet based communication system
CN100396011C (en) Method for detecting net element connection state
CN108667753A (en) A kind of exchange management method and device with redundancy feature
WO2012083756A1 (en) Method and system for transmitting alerting information of tdm distal circuit emulation service
JP3419979B2 (en) Device state management method and data communication system
CN101090405B (en) Fault state transmission method in E1 grade connection configuration
KR20010086774A (en) Processing method for display link status on map and accept event on link
CN106789281A (en) A kind of method that disconnection reconnecting is realized during OPC protocol communications
JP3704504B2 (en) Remote monitoring control method and remote monitoring control device
JP2001326649A (en) Data transmission control system
KR20060126619A (en) Fault management in a ethernet based communication system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080611

Termination date: 20151029

EXPY Termination of patent right or utility model