CN114510374A - Automatic recovery system and method for peripheral mounting failure - Google Patents

Automatic recovery system and method for peripheral mounting failure Download PDF

Info

Publication number
CN114510374A
CN114510374A CN202111664656.XA CN202111664656A CN114510374A CN 114510374 A CN114510374 A CN 114510374A CN 202111664656 A CN202111664656 A CN 202111664656A CN 114510374 A CN114510374 A CN 114510374A
Authority
CN
China
Prior art keywords
peripheral
equipment
self
bios
checking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111664656.XA
Other languages
Chinese (zh)
Inventor
陈小春
张超
朱立森
孙亮
王�琦
易祝兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Clp Technology Beijing Co ltd
Original Assignee
Clp Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Clp Technology Beijing Co ltd filed Critical Clp Technology Beijing Co ltd
Priority to CN202111664656.XA priority Critical patent/CN114510374A/en
Publication of CN114510374A publication Critical patent/CN114510374A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses an automatic recovery system and method for peripheral mounting failure, and belongs to the technical field of computer firmware. The system can enumerate the peripheral equipment to form a list and carry out self-checking one by one when the server is powered on; when the BIOS detects that the peripheral is normally mounted, the BIOS self-checks the next device; when the BIOS detects that the peripheral cannot be mounted normally, the BIOS tries to check the fault in a way of powering on the peripheral board card again through the CPLD for automatic recovery, and records the peripheral fault and self-recovery conditions in a log; and after the self-checking of all the peripherals is completed, all the peripheral drivers are successfully loaded, and the peripherals are successfully enabled. The invention can save the time consumed by restarting the computer due to the failure of the prior peripheral mounting, particularly the time consumed by automatically restarting the server for many times.

Description

Automatic recovery system and method for peripheral mounting failure
Technical Field
The invention belongs to the technical field of computer firmware, and particularly relates to an automatic recovery system and method for peripheral mounting failure.
Background
When the BIOS is running, some peripherals are connected to a motherboard slot of a computer (terminal), but cannot be normally mounted due to the problem of initialization of software and hardware resources. Sometimes, these peripherals can be restored to normal use by rebooting the computer. Generally, the conventional solution is to insert a peripheral device into a peripheral component interconnect express (PCIe) card slot, read a PCI device in a PCI scan phase or a BDS phase after the BIOS is started, and restart the entire computer (system) if the BIOS fails to detect the peripheral device. The method can not ensure that the peripheral equipment is detected for the second time or even the third time (even more times) of restarting, which can seriously reduce the efficiency and the equipment quality, especially for a server, because the number of memory banks is large, the initialization of the memory at the beginning of starting needs much time, and if the restarting is carried out for many times, the use efficiency can be seriously reduced.
Disclosure of Invention
In view of the above, the present invention provides an automatic recovery system and method for a failure of peripheral mounting, which can save the time consumed by restarting a computer (system) due to the failure of peripheral mounting, especially the time consumed by a server requiring multiple automatic restarts.
An automatic recovery system for failure of peripheral mounting is provided, which can enumerate peripherals to form a list and perform self-check one by one when a computer is powered on; when the BIOS detects that the peripheral is normally mounted, the BIOS self-checks the next device; when the BIOS detects that the peripheral cannot be mounted normally, the BIOS tries to check the fault in a way of powering on the peripheral board card again through the CPLD for automatic recovery, and records the peripheral fault and self-recovery conditions in a log; and after the self-checking of all the peripherals is completed, all the peripheral drivers are successfully loaded, and the peripherals are successfully enabled.
Further, the automatic recovery system for the failure of the peripheral mounting comprises a policy configuration module, a peripheral self-checking module, a power-on restart module and a bus protocol module;
the strategy configuration module is used for configuring a strategy for automatic recovery of mounting failure of the peripheral, and the configured contents comprise the peripheral type of the computer needing automatic recovery, the maximum cycle power-on frequency of the automatic recovery and a power-on and power-off control method of a single board card;
the power-on restarting module is used for performing power-on restarting on the external equipment through the CPLD and a standard protocol;
the bus protocol module is used for an interface of external communication of the BIOS.
Further, the peripheral type in the policy configuration module needs to be entered in advance as a basis for providing policy configuration.
Furthermore, the maximum cycle power-on number of the automatic recovery in the policy configuration module is used for limiting the maximum execution number of the automatic recovery action, so that the problem that the whole system cannot be used due to multiple invalid attempts is avoided.
Further, the power-on and power-off control method of the single board card in the policy configuration module is that a fault recovery action is executed at a starting stage of the BIOS according to the type of the peripheral, and the peripheral self-inspection module is used for performing power-on self-inspection on the peripheral by giving corresponding execution time to the CPLD, and when the external device is powered on, the peripheral self-inspection module obtains a peripheral list and performs device inspection through self-inspection instructions one by one; if the equipment is normal, returning to a signal with a normal preset working state; otherwise, a failure error code is returned or no response is made to the self-checking instruction.
A method for automatically recovering system failure of peripheral mounting comprises the following steps:
step one, powering on and starting up a computer, and starting running a BIOS (basic input output System);
step two, BIOS initialization stage, load the automatic recovery system of the failure of external equipment mounting, make BIOS and PCLD form the relation of interconnection and intercommunication, the condition of the monitoring system apparatus;
step three, the BIOS scans the external device bus, enumerates the external devices, generates a device list, and carries out self-checking one by one;
and step four, judging whether the external equipment which is not successfully mounted exists currently. If yes, the next step is carried out; if not, ending the process and turning to the eleventh step;
step five, sending a self-checking instruction to the equipment according to a preset interface of the current peripheral equipment;
step six, judging whether the equipment correctly returns a self-checking result;
step seven, recording the abnormal starting condition of the equipment in a starting log;
step eight, judging whether the automatic recovery frequency of the equipment is more than or equal to 1, and if so, turning to the next step; and if not, skipping the self-checking of the current external equipment and carrying out the self-checking on the next equipment. Turning to the step four;
step nine, the BIOS sends the command of electrifying the peripheral board card to the CPLD, and the CPLD enables the peripheral to be electrified again according to the preset electrifying time sequence;
and step ten, subtracting 1 from the number of times of automatic fault removal of the equipment, and switching to step four.
And step eleven, ending the process.
Has the advantages that:
1. the automatic recovery system for the mounting failure of the peripheral equipment can automatically recover the peripheral equipment in a mode of independently powering on the peripheral equipment again when the mounting failure of the peripheral equipment is found in the BIOS operation stage, so that the fault time of automatically repairing a board card due to the fact that the computer is restarted due to the mounting failure of the peripheral equipment in the prior art can be saved; the efficiency of recovering the normal work of the peripheral equipment by the equipment is improved, and the problem of time waste caused by repeated restarting due to unreliability of restarting recovery is avoided; especially for the server, the advantages of the method are more obvious.
2. The power-on restarting module can interact with a computer power management chip through the bus protocol module, realizes the re-power-on of the board card, performs the automatic recovery of the failure of the external mounting, and can solve the problem of overlong restarting time caused by the automatic recovery of the existing restarting computer.
3. The method for automatically recovering the peripheral type and the restarting times of the computer can reasonably aim at the key peripheral to carry out full but limited restarting repair automatic attempts, can optimally carry out necessary peripheral automatic repair, and reduces necessary peripheral detection and recovery processes.
Drawings
FIG. 1 is a diagram illustrating hardware connections of an automatic recovery system for failure in mounting a peripheral device;
FIG. 2 is a diagram illustrating a software structure of an automatic recovery system for failure of mounting a peripheral device;
fig. 3 is a flowchart of an automatic recovery method of a failure of peripheral mounting.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of hardware connection of an automatic recovery system for failure of mounting a peripheral device, taking a server as an example. The BIOS, BMC, CPLD and peripheral board are connected to the PCH or south bridge. The CPLD is used for power-on time sequence management of the mainboard; the BIOS is used for server hardware initialization and host operating system boot starting; the peripheral board comprises a display card, a RAID card, an optical fiber card and other equipment.
Fig. 2 shows a software structure diagram of the automatic recovery system for failure of mounting a peripheral device. The BIOS is internally provided with an automatic recovery system of the external mounting failure, wherein the automatic recovery system of the external mounting failure comprises a strategy configuration module, an external self-checking module, a power-on restart module and a bus protocol module.
The strategy configuration module is used for configuring a strategy for automatic recovery of the mounting failure of the peripheral equipment, and the configured contents comprise the peripheral equipment type needing automatic recovery of the computer, the maximum cycle power-on frequency of the automatic recovery, fault records, BIOS start-up field storage and a power-on and power-off control method of a single board card of the system.
The peripheral self-checking module is used for carrying out power-on self-checking on the peripheral. The external equipment often has an interface for automatic detection, and when the external equipment is powered on and operated, the external self-detection module acquires an external list and performs equipment inspection through self-detection instructions one by one; if the equipment is normal, returning to a signal with a normal preset working state; otherwise, returning a failure error code or no response to the self-checking instruction;
the power-on restarting module is used for performing power-on restarting on the external equipment through the CPLD and a standard protocol;
the bus protocol module is used for an interface of external communication of the BIOS.
As shown in fig. 3, taking the mount failure of the RAID card as an example, main steps of automatic recovery of the mount failure of the BIOS peripheral device are described:
step one, powering on and starting up a computer.
And step two, initializing the hardware, and loading the automatic recovery system of the external mounting failure.
Step three, the BIOS scans the PCI and other device buses, and the PciaEnumeror () scans and loads the devices to enumerate the external devices; the NewDeviceTreeList () records scanning equipment, compares with a normal equipment list, provides a judgment basis for the loss of the following equipment and generates an equipment list; SelfTestCheck () performs self-checking on scanned devices to ensure correct loading, provides a basis for whether to perform self-checking again, and performs self-checking one by one.
And step four, judging whether external equipment which does not complete self-checking and fault recovery exists currently. In this server, the RAID card is already inserted in the card slot, but it is not successfully scanned.
Step five, sending a self-checking instruction to the equipment according to a preset interface of the RAID card;
step six, BoolTestCheck () judges whether the equipment correctly returns the self-checking result;
step seven, recording a self-checking request into a log by EventRecord (), and skipping the equipment to normally start the whole system if the self-checking frequency of the request is greater than a set value; .
And step eight, judging whether the number of times of automatic fault removal of the equipment is more than or equal to 1. If yes, the next step is carried out; and if not, skipping the self-checking of the current external equipment and carrying out the self-checking on the next equipment. And (5) turning to the step four.
And step nine, prompting failure by the CPLD (complex programmable logic device) self-checking interface, sending a request for re-electrifying to the CPLD, and re-electrifying the RAID card by the CPLD according to a preset electrifying time sequence.
Step ten, the number of times of automatic fault elimination of the equipment is reduced by 1. And (5) turning to the step four.
And step eleven, ending the process.
Under the cooperation of the CPLD and the BIOS, the fault site is recovered in a shorter time, the efficiency of the whole machine is improved, and the pursuit goal of the people is achieved.
The scheme of the invention detects the fault and recovers the fault on site by the cooperation of the BIOS and the CPLD. When the PCI scanning is carried out in the starting process of the BIOS, whether the specific equipment is in place or not is detected, when the scanning mounting or the self-checking fails, the detection is linked with the CPLD to inform that the CPLD hardware needs to be reset and powered on again, at the moment, the CPLD receives a signal of the BIOS and powers on the peripheral again, and the BIOS scans again to enable the peripheral to be recovered to be normally used. If the BIOS is not normally recovered in the period, linkage is carried out again, and after the linkage is carried out for five times, the BIOS is normally loaded and executed.
The design solves the problem that the method for solving the fault at the present stage is long in time, and particularly for the server, the measure has great advantages.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. An automatic recovery system for failure of peripheral mounting is characterized in that the system can enumerate peripherals to form a list and perform self-check one by one when a computer is powered on; when the BIOS detects that the peripheral is normally mounted, the BIOS self-checks the next device; when the BIOS detects that the peripheral cannot be mounted normally, the BIOS tries to check the fault in a way of powering on the peripheral board card again through the CPLD for automatic recovery, and records the peripheral fault and self-recovery conditions in a log; and after the self-checking of all the peripherals is completed, all the peripheral drivers are successfully loaded, and the peripherals are successfully enabled.
2. The automatic recovery system for the mounting failure of the peripheral equipment according to claim 1, wherein the automatic recovery system for the mounting failure of the peripheral equipment comprises a policy configuration module, a peripheral self-test module, a power-on restart module and a bus protocol module;
the strategy configuration module is used for configuring a strategy for automatic recovery of mounting failure of the peripheral, and the configured contents comprise the peripheral type of the computer needing automatic recovery, the maximum cycle power-on frequency of the automatic recovery and a power-on and power-off control method of a single board card;
the power-on restarting module is used for performing power-on restarting on the external equipment through the CPLD and a standard protocol;
the bus protocol module is used for an interface of external communication of the BIOS.
3. The system for automatically recovering the mounting failure of the peripheral equipment as claimed in claim 2, wherein the peripheral equipment type in the policy configuration module needs to be entered in advance as a basis for providing the policy configuration.
4. The system for automatically recovering the peripheral device mounting failure according to claim 3, wherein the maximum cycle power-on number of the automatic recovery in the policy configuration module is used for limiting the maximum execution number of the automatic recovery action, so as to avoid that the whole system cannot be used due to multiple invalid attempts.
5. The automatic recovery system for the failure of mounting the peripheral equipment according to claim 4, wherein the power-on and power-off control method of the single board card in the policy configuration module is to determine which stage of the BIOS start-up the failure recovery is performed according to the type of the peripheral equipment, and give the CPLD a corresponding execution time to the peripheral equipment self-checking module for performing power-on self-checking on the peripheral equipment, and when the peripheral equipment is powered on, the peripheral equipment self-checking module obtains a list of the peripheral equipment and performs equipment checking through self-checking instructions one by one; if the equipment is normal, returning to a signal with a normal preset working state; otherwise, a failure error code is returned or no response is made to the self-checking instruction.
6. An automatic recovery method for failure of external device mounting is characterized in that the method comprises the following steps:
step one, powering on and starting up a computer, and starting running a BIOS (basic input output System);
step two, a BIOS initialization stage, namely loading an automatic recovery system of the failure of peripheral mounting, so that an interconnection relation is formed between the BIOS and the PCLD, and the condition of system equipment is monitored;
step three, the BIOS scans the external device bus, enumerates the external devices, generates a device list, and carries out self-checking one by one;
and step four, judging whether the external equipment which is not successfully mounted exists currently. If yes, the next step is carried out; if not, ending the process and turning to the eleventh step;
step five, sending a self-checking instruction to the equipment according to a preset interface of the current peripheral equipment;
step six, judging whether the equipment correctly returns a self-checking result;
step seven, recording the abnormal starting condition of the equipment in a starting log;
step eight, judging whether the automatic recovery frequency of the equipment is more than or equal to 1, and if so, turning to the next step; and if not, skipping the self-checking of the current external equipment and carrying out the self-checking on the next equipment. Turning to the step four;
step nine, the BIOS sends the command of electrifying the peripheral board card to the CPLD, and the CPLD enables the peripheral to be electrified again according to the preset electrifying time sequence;
step ten, subtracting 1 from the number of times of automatic fault removal by the equipment, and turning to step four.
And step eleven, ending the process.
CN202111664656.XA 2021-12-31 2021-12-31 Automatic recovery system and method for peripheral mounting failure Pending CN114510374A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111664656.XA CN114510374A (en) 2021-12-31 2021-12-31 Automatic recovery system and method for peripheral mounting failure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111664656.XA CN114510374A (en) 2021-12-31 2021-12-31 Automatic recovery system and method for peripheral mounting failure

Publications (1)

Publication Number Publication Date
CN114510374A true CN114510374A (en) 2022-05-17

Family

ID=81548709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111664656.XA Pending CN114510374A (en) 2021-12-31 2021-12-31 Automatic recovery system and method for peripheral mounting failure

Country Status (1)

Country Link
CN (1) CN114510374A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115586981A (en) * 2022-11-25 2023-01-10 深圳华北工控股份有限公司 Method, system, computer and storage medium for preventing SIO signal loss

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115586981A (en) * 2022-11-25 2023-01-10 深圳华北工控股份有限公司 Method, system, computer and storage medium for preventing SIO signal loss

Similar Documents

Publication Publication Date Title
US20240086269A1 (en) Method, Apparatus and System for Locating Fault of Server, and Computer-readable Storage Medium
CN107122321B (en) Hardware repair method, hardware repair system, and computer-readable storage device
US20070234123A1 (en) Method for detecting switching failure
CN108304282B (en) Control method of double BIOS and related device
CN111488233A (en) Method and system for processing bandwidth loss problem of PCIe device
CN113064757B (en) Server firmware self-recovery system and server
CN111143132B (en) BIOS recovery method, device, equipment and readable storage medium
CN215769715U (en) Recovery device for abnormal starting
CN114116280B (en) Interactive BMC self-recovery method, system, terminal and storage medium
CN115237644B (en) System fault processing method, central operation unit and vehicle
US11263083B1 (en) Method and apparatus for selective boot-up in computing devices
CN111949333A (en) System and method for realizing main-standby switching of BIOS (basic input output System) of ARM (advanced RISC machine) server
CN111338698A (en) Method and system for accurately booting server by BIOS (basic input output System)
CN115809164A (en) Embedded equipment, embedded system and hierarchical reset control method
CN114510374A (en) Automatic recovery system and method for peripheral mounting failure
CN117389781B (en) Abnormality detection and recovery method and system for server equipment, server and medium
CN104657232A (en) BIOS automatic recovery system and BIOS automatic recovery method
KR100605031B1 (en) A method for upgrading and restoring embeded systems by using usb memory device
CN110928726A (en) Embedded system self-recovery method and system based on watchdog and PXE
CN116048400A (en) Hardware recovery method, device, equipment and readable storage medium
CN113220324B (en) CPLD remote updating method, system and medium
CN114385405A (en) Method, device and system for realizing server restart reason recording
CN114528555A (en) ARM server firmware safety check starting management method, device and medium
CN115081035B (en) Program encryption upgrading circuit and method based on processor and FPGA chip
CN115292090A (en) SoC system mirror image automatic repair circuit and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 north side, 13th floor, Taiji building, No.6 working area (South), wohuqiao, Haidian District, Beijing

Applicant after: Kunlun Taike (Beijing) Technology Co.,Ltd.

Address before: 100083 north side, 13th floor, Taiji building, No.6 working area (South), wohuqiao, Haidian District, Beijing

Applicant before: CLP Technology (Beijing) Co.,Ltd.