CN103957130A - Fault detection and recovery method and system - Google Patents

Fault detection and recovery method and system Download PDF

Info

Publication number
CN103957130A
CN103957130A CN201410138065.2A CN201410138065A CN103957130A CN 103957130 A CN103957130 A CN 103957130A CN 201410138065 A CN201410138065 A CN 201410138065A CN 103957130 A CN103957130 A CN 103957130A
Authority
CN
China
Prior art keywords
register
subcard
sign
fault
fault detect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410138065.2A
Other languages
Chinese (zh)
Other versions
CN103957130B (en
Inventor
杨庆辰
秦佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Maipu Communication Technology Co Ltd
Original Assignee
Maipu Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maipu Communication Technology Co Ltd filed Critical Maipu Communication Technology Co Ltd
Priority to CN201410138065.2A priority Critical patent/CN103957130B/en
Publication of CN103957130A publication Critical patent/CN103957130A/en
Application granted granted Critical
Publication of CN103957130B publication Critical patent/CN103957130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to modularized communication network equipment and discloses a fault detection and recovery method and system. The fault detection and recovery method includes the steps that (a) a first identifier is written into a first register, and a second identifier is written into a second register; (1b) the first identifier in the first register is detected; (1c) whether the first identifier changes is judged, wherein it shows that a daughter card fails if the first identifier changes, then the step 1d is carried out, and otherwise, the step 1b is executed again after a set time is due; (1d) uninstalling and reloading are executed on the daughter card; (2b) the second identifier in the second register is detected; (2c) whether the second identifier changes is judged, wherein it shows that the daughter card fails if the second identifier changes, then the step 2d is carried out, and otherwise, the step 2b is executed again after a set time is due. The steps from 1b to 1d and the steps from 2b to 2d are executed in parallel or one after another. The invention further discloses the fault detection and recovery system. Through the fault detection and recovery method and system, the time of service interruption affected by a fault is shortened, and the investment cost of manpower is reduced.

Description

Fault detect and restoration methods and system
Technical field
The present invention relates to network communication field, the method that particularly modular communication network equipment subcard fault automatically detects and recovers.
Background technology
The network equipment of the present invention comprises switch and routing device etc.The appearance of modularization subcard has promoted the heavy duty of communication service, and for example same switch can be used the modularization subcard of difference in functionality to realize different business, and without realize the replacing of business with dissimilar switch.But due to the characteristic of subcard, the maintenance of subcard also becomes the large problem in subcard use.
In order to realize the modularization of network equipment switching port, sometimes the exchange chip on the network equipment and physical chip are deployed on different printed circuit board (PCB) (PCB) plate, generally be deployed with exchange chip and CPU(central processing unit) board be referred to as master card, the board of only disposing physical chip is called subcard.
Conventionally the power supply of the physical chip on subcard is provided by master card, exchange chip in master card is by MDIO(Management Data Interface simultaneously, management data interface) register of channel management physical chip, the CPU in master card is the CPLD(CPLD on subcard by management) physical states such as relevant power supply, temperature of managing subcard.
Due to the characteristic of subcard, can there are in actual use two class faults and cause service disconnection, a class is that master card is power-down rebooting not when voltage dip occurs in power supply, there is power-down rebooting in subcard.Problem is now that system does not perceive the power down of subcard, and physical chip on subcard is due to power down with re-power and entered initial condition, and the organize content of at this moment preserving in the actual chips content of subcard and system is inconsistent.Equations of The Second Kind fault is after having there is comparatively serious electric shock, and physical chip there will be the connection failure of port connecting link and cannot connect normal situation by enable link again.The reason of this problem is the PMD(Physical Medium Dependent of physical chip, physical medium associated layers interface), because electric shock cannot normally be identified the physical signalling that opposite equip. sends, cause the situation of port link connection failure.
Because subcard itself has carried a part of business, president's time effects client's related service content when breaking down, repair this fault needs again relevant human input simultaneously.In order to guarantee the stability of user's FPDP, reduce the consumption of manual reversion simultaneously, need effective means to locate fast and solve subcard fault.
Summary of the invention
Object of the present invention is just to provide a kind of fault detect and restoration methods, automatically detects subcard fault and repairs, and guarantees the stable transfer of business datum.
Technical scheme of the present invention is that fault detect and restoration methods, comprise the steps:
A, in the first register, write the first sign, in the second register, write the second sign;
1b, detect first in described the first register sign;
1c, judge first sign whether change, be to represent subcard fault, enter step 1d; Otherwise return to 1b after waiting for the time of setting;
1d, subcard is carried out unloading and reloaded action;
2b, detect second in described the second register sign;
2c, judge second sign whether change, be to represent subcard fault, enter step 2d; Otherwise return to step 2b after waiting for the time of setting;
2d, the action of restarting to subcard execution physical medium associated layers interface;
Step 1b~1d and step 2b~2d executed in parallel or successively execution.
Further, in step 1d, subcard is carried out unloading and reloaded action 3 times.
Further, after completing steps 1d, execution step:
Whether 1e, detection failure are recovered, and are to return to 1b after waiting for time of setting; Otherwise record trouble daily record.
Accordingly, in step 2d, subcard being carried out to restarting of physical medium associated layers interface moves 3 times.
Further, after completing steps 2d, execution step:
Whether 2e, detection failure are recovered, and are to return to 2b after waiting for time of setting; Otherwise record trouble daily record.
Concrete, described the first register is CPLD register.
Concrete, described the second register is physical chip register.
Another object of the present invention is, a kind of fault detect and recovery system are provided, and comprises initialization module, register detection module, automatic repairing module;
Described initialization module for initialization system, writes the first sign in the first register, writes the second sign in the second register;
Described register detection module, for detection of the first sign of the first register and the second sign of the second register, and judges whether described sign changes, and is to represent subcard fault, otherwise after the certain hour of interval, continues to detect;
Described automatic repairing module, carries out fault recovery for antithetical phrase card, and subcard is carried out unloading and reloaded the action of restarting of moving or subcard being carried out to PMD.
Further, described automatic repairing module is recovered the daily record of unsuccessful record trouble 3 times.
Concrete, described the first register is CPLD register, described the second register is physical chip register.
The invention has the beneficial effects as follows, solved conventional network equipment modularization subcard, Maintenance Difficulty is out of order under adverse circumstances, affect long problem of customer service time, user is without the fault of manual reversion subcard, fault detect and recovery complete automatically, have reduced the service outage duration of fault effects, have saved human input cost simultaneously.
Accompanying drawing explanation
Fig. 1 is schematic flow sheet of the present invention;
Fig. 2 is system configuration schematic diagram of the present invention.
Embodiment
Below in conjunction with the drawings and the specific embodiments, describe technical scheme of the present invention in detail.
Subcard fault of the present invention comprises two kinds of fault types: power down fault and port link connection fault.Corresponding fault recovery is also the recovery for above-mentioned two kinds of faults.Generally in subcard insertion system, system management module can write the first ident value in the CPLD register on subcard, this value under electrifying condition in this register can not change, when the value of this register changes (as chip default value), there is subcard power down in explanation just, now just need to carry out unloading and the loading procedure of subcard software view, the related administrative information of guarantee system management layer is reloaded on subcard, the content that management information could be comprised is like this write in the related management register of physical chip on subcard, it is come into force.For port link failure, under switching port normal operating conditions on subcard, there is link connection failure, by reading the value of physical chip status register, judge whether it receives the signal of opposite end, whether be artificial close port link action, if be not, illustrate and occurred fault, now the PMD on subcard is carried out and once restarts action, make PMD recover normal, thereby make whole FPDP recover normal.
As shown in Figure 1, subcard fault detect of the present invention and restoration methods concrete steps are as follows:
Step a, in the first register, write the first sign, in the second register, write the second sign.
According to different subcard types, the first register and the second register are had nothing in common with each other, and the first sign wherein writing and the second sign are respectively used to detection of power loss and port connecting link connection failure detection.First register is here CPLD register, and wherein storage is designated the first sign.The second register is physical chip register, its correspondence be designated the second sign.
This step is the initialization procedure of system, in the time of subcard insertion system, corresponding subcard initialization action comprises three parts, and whether the subcard that the first judgement is inserted is the subcard type that need to carry out fault detect, more corresponding subcard information is joined in system management structure.The secondth, initialization detects sign, and the CPLD register of subcard is write to the first sign, if certain particular value is for power down fault detect.At physical chip register, write the second sign for connection failure fault detect.
Subcard power down fault detect and recovering step:
Step 1b, detect first in described the first register sign;
Step 1c, judge first sign whether change, be to represent subcard fault, enter step 1d; Otherwise return to 1b after waiting for the time of setting;
Step 1d, subcard is carried out unloading and reloaded action.
In this process, system can read the value in subcard CPLD register by poll, and when finding that this value changes, while becoming initial value, power down has occurred in explanation, now calls reparation module and carries out the unloading of subcard and again load action.After complete, can again read the content of CPLD register, judge that whether whole repair process is successful, after finding to repair successfully, carry out next but one poll, if failure, can repeat unloading and again load action 3 times, if or can not recover, for the reliability of system needs log analysis.The latency period of setting can be set by the user.
The connection failure fault detect of subcard port connecting link and step:
Step 2b, detect second in described the second register sign;
Step 2c, judge second sign whether change, be to represent subcard fault, enter step 2d; Otherwise after the t2 that waits for a period of time, return to step 2b;
Step 2d, the action of restarting to subcard execution pmd layer.
In this process, system can be removed the physical chip buffer status of interior each port of physical chip on poll subcard, first obtain the port status of port, check that port is that link connection is normal or link connection is failed, when port is link connection failure, obtain the signal of whether receiving that opposite end sends over, after finding to receive the signal that opposite end sends, can obtain physical chip buffer status judges whether artificially to have carried out and closes link, when getting when link is closed in execution, just do not illustrate that the condition of fault meets, in order to prevent doing repair action under the unsettled intermediateness of port, testing process can be carried out specific times within the set time, when each detection all meets fault condition, illustrate fault has occurred really, now first the reparation sign of port is set to effective explanation and will carries out reparation, carry out again the action of restarting to PMD, after complete, obtain again port status, after state is normal, the reparation sign of port is set to invalid explanation reparation success, when recovering normal, port status can again not carry out repair action, when having carried out three rear port states, also do not recover normal, illustrate that the PMD that is not port goes wrong, maintainability for system, can record relevant error log.The polling time interval of described process also can be set voluntarily by user.
In the present invention, subcard power down fault detect and recovering step (step 1b~1d) and the connection failure fault detect of subcard port connecting link and step (step 2b~2d) can executed in parallel or successively execution.Successively carry out and normally first carry out power down fault detect and recovery, after carry out the connection failure fault detect of port connecting link and recovery.
When subcard is extracted, system will be carried out uninstall process, now can first stop corresponding poll action, and it is invalid afterwards detection sign corresponding in administration module to be set to, and illustrates that corresponding subcard is not in place, no longer it is detected.
Fault detect of the present invention and recovery system, structure as shown in Figure 2, comprises initialization module, register detection module, automatic repairing module.
Described initialization module for initialization system, writes the first sign in the first register, writes the second sign in the second register.Described the first register is CPLD register, and described the second register is physical chip register.
Described register detection module, for detection of the first sign of the first register and the second sign of the second register, and judges whether described sign changes, and is to represent subcard fault, otherwise after the certain hour of interval, continues to detect;
Described automatic repairing module, carries out fault recovery for antithetical phrase card, and subcard is carried out unloading and reloaded the action of restarting of moving or subcard being carried out to pmd layer.Described automatic repairing module is recovered the daily record of unsuccessful record trouble 3 times.

Claims (10)

1. fault detect and restoration methods, comprise the steps:
A, in the first register, write the first sign, in the second register, write the second sign;
1b, detect first in described the first register sign;
1c, judge first sign whether change, be to represent subcard fault, enter step 1d; Otherwise return to 1b after waiting for the time of setting;
1d, subcard is carried out unloading and reloaded action;
2b, detect second in described the second register sign;
2c, judge second sign whether change, be to represent subcard fault, enter step 2d; Otherwise return to step 2b after waiting for the time of setting;
2d, the action of restarting to subcard execution physical medium associated layers interface;
Step 1b~1d and step 2b~2d executed in parallel or successively execution.
2. fault detect according to claim 1 and restoration methods, is characterized in that, in step 1d, subcard carried out unloading and reloaded action 3 times.
3. subcard fault detect according to claim 2 and restoration methods, is characterized in that, after completing steps 1d, and execution step:
Whether 1e, detection failure are recovered, and are to return to 1b after waiting for time of setting; Otherwise record trouble daily record.
4. fault detect according to claim 1 and restoration methods, is characterized in that, in step 2d, subcard carried out to restarting of physical medium associated layers interface and move 3 times.
5. fault detect according to claim 4 and restoration methods, is characterized in that, after completing steps 2d, and execution step:
Whether 2e, detection failure are recovered, and are to return to 2b after waiting for the time of setting; Otherwise record trouble daily record.
6. fault detect according to claim 1 and restoration methods, is characterized in that, described the first register is CPLD register.
7. fault detect according to claim 1 and restoration methods, is characterized in that, described the second register is physical chip register.
8. fault detect and recovery system, comprise initialization module, register detection module, automatic repairing module;
Described initialization module for initialization system, writes the first sign in the first register, writes the second sign in the second register;
Described register detection module, for detection of the first sign of the first register and the second sign of the second register, and judges whether described sign changes, and is to represent subcard fault, otherwise after waiting for the time of setting, continues to detect;
Described automatic repairing module, carries out fault recovery for antithetical phrase card, and subcard is carried out unloading and reloaded the action of restarting of moving or subcard being carried out to PMD.
9. fault detect and recovery system according to claim 8, is characterized in that, described automatic repairing module is recovered the daily record of unsuccessful record trouble 3 times.
10. fault detect and recovery system according to claim 8, is characterized in that, described the first register is CPLD register, and described the second register is physical chip register.
CN201410138065.2A 2014-04-08 2014-04-08 Fault detect and restoration methods and system Active CN103957130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410138065.2A CN103957130B (en) 2014-04-08 2014-04-08 Fault detect and restoration methods and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410138065.2A CN103957130B (en) 2014-04-08 2014-04-08 Fault detect and restoration methods and system

Publications (2)

Publication Number Publication Date
CN103957130A true CN103957130A (en) 2014-07-30
CN103957130B CN103957130B (en) 2017-07-18

Family

ID=51334360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410138065.2A Active CN103957130B (en) 2014-04-08 2014-04-08 Fault detect and restoration methods and system

Country Status (1)

Country Link
CN (1) CN103957130B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106375114A (en) * 2016-08-26 2017-02-01 迈普通信技术股份有限公司 Hot plug fault recovery method and distributed device
CN110083484A (en) * 2018-01-26 2019-08-02 阿里巴巴集团控股有限公司 FPGA reloads method, equipment, storage medium and system
CN110784376A (en) * 2019-10-25 2020-02-11 北京东土军悦科技有限公司 Equipment with Ethernet PHY register detection function, detection method and device
CN114142471A (en) * 2021-11-29 2022-03-04 江苏科技大学 Ship integrated power system reconstruction method considering communication faults

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1422041A (en) * 2001-11-26 2003-06-04 深圳市中兴通讯股份有限公司上海第二研究所 Back-up system based on fast Ethernet bus
KR100454971B1 (en) * 2002-10-31 2004-11-06 삼성전자주식회사 Network interface line card system
CN101645915A (en) * 2008-08-06 2010-02-10 中兴通讯股份有限公司 Disk array host channel daughter card, on-line switching system and switching method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1422041A (en) * 2001-11-26 2003-06-04 深圳市中兴通讯股份有限公司上海第二研究所 Back-up system based on fast Ethernet bus
KR100454971B1 (en) * 2002-10-31 2004-11-06 삼성전자주식회사 Network interface line card system
CN101645915A (en) * 2008-08-06 2010-02-10 中兴通讯股份有限公司 Disk array host channel daughter card, on-line switching system and switching method thereof

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106375114A (en) * 2016-08-26 2017-02-01 迈普通信技术股份有限公司 Hot plug fault recovery method and distributed device
CN110083484A (en) * 2018-01-26 2019-08-02 阿里巴巴集团控股有限公司 FPGA reloads method, equipment, storage medium and system
CN110784376A (en) * 2019-10-25 2020-02-11 北京东土军悦科技有限公司 Equipment with Ethernet PHY register detection function, detection method and device
CN114142471A (en) * 2021-11-29 2022-03-04 江苏科技大学 Ship integrated power system reconstruction method considering communication faults
CN114142471B (en) * 2021-11-29 2023-08-18 江苏科技大学 Ship comprehensive power system reconstruction method considering communication faults

Also Published As

Publication number Publication date
CN103957130B (en) 2017-07-18

Similar Documents

Publication Publication Date Title
CN101589592B (en) Multi-protocol removable storage device
CN102508755B (en) Device and method for simulating interface card hot-plugging
CN101207408B (en) Apparatus and method of synthesis fault detection for main-spare taking turns
US20060161714A1 (en) Method and apparatus for monitoring number of lanes between controller and PCI Express device
CN103559053A (en) Board system and FPGA (Field Programmable Logic Array) online update method of communication interface cards
RU2013123983A (en) ARCHITECTURE OF HEALTH MONITORING SYSTEMS
CN103957130A (en) Fault detection and recovery method and system
CN103744743A (en) Heartbeat signal redundant configuration method based on RAC model of database
CN105653345A (en) Method and device supporting data nonvolatile random access
CN108804260B (en) SRIO system switching method and device
CN101494564B (en) Apparatus for monitoring power supply and method for implementing veneer thermal backup
CN109917897B (en) Redundant board power management system and method
CN103475514B (en) Node, group system and BIOS without BMC repair and upgrade method
CN101894056A (en) Bus and working node isolation device and fault recovery system and method thereof
CN102457400A (en) Method for preventing split brain phenomenon from occurring on distributed replicated block device (DRBD) resource
US7925728B2 (en) Facilitating detection of hardware service actions
CN111858448A (en) Method and device for deadlock and recovery of I2C
CN112019455A (en) Switch monitoring device and method based on programmable logic device
US20070180329A1 (en) Method of latent fault checking a management network
CN111158963A (en) Server firmware redundancy starting method and server
CN106708541A (en) Version upgrading processing method and apparatus
CN105843336A (en) Rack with a plurality of rack management modules and method for updating firmware thereof
CN101378339B (en) Control method and apparatus, business board for heat insertion and pull
CN105630626A (en) Transaction backup processing method and device
CN108418707B (en) Method for upgrading mutual online backup of double CPLDs in communication system and service veneer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 610041 nine Xing Xing Road 16, hi tech Zone, Sichuan, Chengdu

Patentee after: MAIPU COMMUNICATION TECHNOLOGY Co.,Ltd.

Address before: 610041 Sichuan city of Chengdu province high tech Zone nine Hing Road No. 16 building, Maipu

Patentee before: MAIPU COMMUNICATION TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder