WO2022198973A1 - 一种服务器固件自恢复系统及服务器 - Google Patents

一种服务器固件自恢复系统及服务器 Download PDF

Info

Publication number
WO2022198973A1
WO2022198973A1 PCT/CN2021/121423 CN2021121423W WO2022198973A1 WO 2022198973 A1 WO2022198973 A1 WO 2022198973A1 CN 2021121423 W CN2021121423 W CN 2021121423W WO 2022198973 A1 WO2022198973 A1 WO 2022198973A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage module
controller
module
main storage
startup
Prior art date
Application number
PCT/CN2021/121423
Other languages
English (en)
French (fr)
Inventor
韩红瑞
Original Assignee
山东英信计算机技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东英信计算机技术有限公司 filed Critical 山东英信计算机技术有限公司
Priority to US18/024,809 priority Critical patent/US20230333621A1/en
Publication of WO2022198973A1 publication Critical patent/WO2022198973A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/30Means for acting in the event of power-supply failure or interruption, e.g. power-supply fluctuations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1417Boot up procedures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • G06F11/3062Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations where the monitored property is the power consumption
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering

Definitions

  • the present application relates to the field of server operation and maintenance, and in particular, to a server firmware self-recovery system and a server.
  • the firmware program of the server system is usually stored in a Flash (flash memory) chip, and the system startup is completed by reading the firmware program in the Flash chip.
  • Flash flash memory
  • the existing operation and maintenance work can only report the failure of the system to start to the operation and maintenance server, and maintenance personnel are required to perform maintenance on the Flash chip in the future.
  • Abnormal repair processing results in longer operation and maintenance time, lower operation and maintenance efficiency, and greater operation and maintenance work pressure.
  • the purpose of this application is to provide a server firmware self-recovery system and server, which can automatically perform abnormal repair processing on abnormally activated storage modules, thereby reducing operation and maintenance time, improving operation and maintenance efficiency, and reducing operation and maintenance work pressure.
  • the present application provides a server firmware self-recovery system, including:
  • a storage module for storing system firmware programs
  • a startup controller configured to read the system firmware program from the storage module to start the system under the condition that it establishes a communication connection with the storage module;
  • a repair controller configured to automatically perform abnormal repair processing on the storage module when the storage module starts abnormally and establishes a communication connection with the storage module;
  • a logic controller connected to the storage module, the startup controller and the repair controller respectively, is used to establish a communication connection between the storage module and the startup controller by default in an initial situation; If the boot controller fails to boot the system, it is determined that the storage module is abnormally booted, and the storage module is switched to establish a communication connection with the repair controller.
  • the boot controller includes ME; the system firmware program includes ME firmware program;
  • the startup controller is specifically configured to read the ME firmware program from the storage module to enable the ME to run when it establishes a communication connection with the storage module, and when receiving power After the key signal is pressed, the ME is used to generate a power-on start state signal;
  • the logic controller is specifically used to receive the power button signal of the system, and send the power button signal to the startup controller; and determine whether the power-on startup status signal returned by the startup controller is received within a preset time. , if yes, the control system hardware is powered on to start the system; if not, it is determined that the storage module is abnormally activated, and the storage module is switched to establish a communication connection with the repair controller.
  • the logic controller includes:
  • a detection module connected to the startup controller is used for, after the power button signal is sent to the startup controller, if the power-on startup status signal returned by the startup controller is not received within a preset time , the status signal timeout result is sent to the control module;
  • a gating module in which the first gating end is connected to the storage module, the first transmission end is connected to the startup controller, and the second transmission end is connected to the repairing controller;
  • a control module connected with the detection module, the second communication module, the state register and the gating module respectively, is used to control the gating module to default the first gating terminal in an initial situation. communicate with the first transmission end; after receiving the status signal timeout result, determine that the storage module is abnormally activated, and record the abnormal startup situation and reason of the storage module to the status register;
  • the repair controller is also used for polling the status register through the communication module, and when the abnormal startup of the storage module is inquired, the gating module is controlled by communication with the control module to The first gating terminal is switched to communicate with the second transmission terminal, so as to automatically perform abnormal repair processing on the storage module based on the queried cause of the abnormal startup of the storage module.
  • the logic controller further includes:
  • the state memory module connected with the control module is used to store the most recent gating state of the gating module; wherein, the initial default gating state of the gating module of the state memory module is: the first a gating end communicates with the first transmission end;
  • the control module is also used to read the last gating state of the gating module from the state memory module after the system AC is powered on and starts to work, and controls the gating module to keep the last selected state. pass status.
  • the storage module includes a main storage module and a backup storage module that both store system firmware programs;
  • the logic controller is specifically configured to establish a communication connection between the main storage module and the startup controller by default in an initial situation; if it is detected that the startup controller fails to start the system, determine the main storage module. If the module starts abnormally, switch the primary storage module to establish a communication connection with the repair controller, and establish a communication connection between the backup storage module and the startup controller, so that the startup controller restarts the system.
  • the logic controller is further configured to, after establishing a communication connection between the backup storage module and the boot controller, determine that the backup storage module is abnormally activated if it is detected that the boot controller fails to restart the system ;
  • the repair controller is specifically configured to automatically perform abnormal repair processing on the main storage module when the main storage module starts abnormally and the backup storage module works normally; When both the backup storage modules are abnormally activated, abnormal repair processing is performed on the main storage module and the backup storage module in sequence.
  • the process of performing abnormal repair processing on the main storage module includes:
  • the backup ME firmware program is rewritten to the main storage module for automatic repair, and the main storage module is switched back to re-establishing a communication connection with the startup controller;
  • the process of sequentially performing abnormal repair processing on the primary storage module and the backup storage module includes:
  • the backup ME firmware program is rewritten to the main storage module for automatic repair, and the logic control is controlled after the flashing is completed.
  • the controller switches the main storage module back to establish a communication connection with the startup controller, and controls the startup controller to restart the system to determine whether the system can restart normally;
  • the ME firmware program is re-flashed into the backup storage module, and after the flashing is completed, the logic controller is controlled to switch the backup storage module back to establish a communication connection with the boot controller, and control the boot
  • the controller restarts the system to determine whether the system can be restarted normally; if the system can be restarted normally, it is determined that the main storage module and the backup storage module have been repaired;
  • the system firmware program further includes a BIOS firmware program; the startup controller is connected to the repair controller;
  • the startup controller is further configured to read the BIOS firmware program from the storage module to start the BIOS after the system is powered on;
  • the repair controller is also used for judging the abnormal startup of the BIOS during the process of booting the BIOS by the boot controller, and when the BIOS is abnormally booted due to a problem with the BIOS firmware program, automatically performs the operation on the storage module. BIOS firmware abnormal repair processing.
  • the startup controller is a CPU; the repair controller is a BMC; and the logic controller is a CPLD.
  • the present application also provides a server, including any of the above-mentioned server firmware self-recovery systems.
  • the application provides a server firmware self-recovery system, including a storage module, a startup controller, a repair controller and a logic controller.
  • the storage module is used to store system firmware programs; the startup controller is used to establish a communication connection between itself and the storage module.
  • the system firmware program is read from the storage module to start the system; the repair controller is used to automatically perform abnormal repair processing on the storage module when the storage module starts abnormally and establishes a communication connection with the storage module;
  • the logic controller is used to In the initial situation, the storage module is set up a communication connection with the boot controller by default; if it is detected that the boot controller fails to boot the system, it is determined that the storage module is abnormally booted, and the storage module is switched to establish a communication connection with the repair controller. It can be seen that the present application can automatically perform abnormal repair processing on the abnormally activated storage module, thereby reducing the operation and maintenance time, improving the operation and maintenance efficiency, and reducing the operation and maintenance work pressure.
  • the present application also provides a server, which has the same beneficial effects as the above-mentioned firmware self-recovery system.
  • FIG. 1 is a schematic structural diagram of a server firmware self-recovery system according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a server firmware self-recovery system according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of a connection structure of a first gating module provided in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a connection structure of a second gating module provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a connection structure of a third gating module provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a connection structure of a fourth gating module provided in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a connection structure of a fifth gating module provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a connection structure of a sixth gating module provided by an embodiment of the present application.
  • the core of the present application is to provide a server firmware self-recovery system and server, which can automatically perform abnormal repair processing on abnormally activated storage modules, thereby reducing operation and maintenance time, improving operation and maintenance efficiency, and reducing operation and maintenance work pressure.
  • FIG. 1 is a schematic structural diagram of a server firmware self-recovery system according to an embodiment of the present application.
  • the server firmware self-recovery system includes:
  • the storage module 1 is used to store the system firmware program
  • the startup controller 2 is used to read the system firmware program from the storage module 1 to start the system when it establishes a communication connection with the storage module 1;
  • the repair controller 3 is used to automatically perform abnormal repair processing on the storage module 1 when the storage module 1 is abnormally activated and establishes a communication connection with the storage module 1;
  • the logic controller 4 connected to the storage module 1, the startup controller 2 and the repair controller 3 respectively is used to establish a communication connection between the storage module 1 and the startup controller 2 by default under the initial situation; if it is detected that the startup controller 2 starts If the system fails, it is determined that the storage module 1 is abnormally activated, and the storage module 1 is switched to establish a communication connection with the repair controller 3 .
  • the server firmware self-recovery system of the present application includes a storage module 1, a startup controller 2, a repair controller 3 and a logic controller 4, and its working principle is:
  • the storage module 1 is used to store the system firmware program.
  • the logic controller 4 establishes a communication connection between the storage module 1 and the startup controller 2 by default.
  • the boot controller 2 reads the system firmware program from the storage module 1 to boot the system.
  • the logic controller 4 detects that the boot controller 2 starts the system, and if it detects that the boot controller 2 fails to boot the system, it determines that the storage module 1 is abnormally booted, and switches the storage module 1 to establish a communication connection with the repair controller 3 .
  • the repair controller 3 automatically performs an abnormal repair process on the storage module 1 .
  • the present application can automatically perform abnormal repair processing on the abnormally activated storage module 1, thereby reducing the operation and maintenance time, improving the operation and maintenance efficiency, and reducing the operation and maintenance work pressure.
  • FIG. 2 is a schematic diagram of a specific structure of a server firmware self-recovery system according to an embodiment of the present application.
  • the boot controller 2 includes ME; the system firmware program includes ME firmware program;
  • the startup controller 2 is specifically used to read the ME firmware program from the storage module 1 to make the ME run when it establishes a communication connection with the storage module 1, and after receiving the power button signal, use the ME to generate the Electric start status signal;
  • the logic controller 4 is specifically used to receive the power button signal of the system, and send the power button signal to the startup controller 2; determine whether the power-on startup state signal returned by the startup controller 2 is received within a preset time, and if so, control the The system hardware is powered on to start the system; if not, it is determined that the storage module 1 is abnormally started, and the storage module 1 is switched to establish a communication connection with the repair controller 3 .
  • startup controller 2 of the present application includes ME (Management Engine, management engine), and the system firmware program includes the ME firmware program.
  • ME Management Engine, management engine
  • system firmware program includes the ME firmware program.
  • the level of the power button signal changes, and the power button signal is sent to the logic controller 4 at this time.
  • the logic controller 4 After receiving the power button signal of the system, the logic controller 4 sends the power button signal to the startup controller 2 .
  • the startup controller 2 When the startup controller 2 establishes a communication connection with the storage module 1, it reads the ME firmware program from the storage module 1 to make the ME run, and after receiving the power button signal, uses the ME to generate a power-on startup status signal ( S3 status signal).
  • the logic controller 4 starts timing when the power button signal is sent to the startup controller 2, and judges whether the startup controller 2 sends a power-on startup status signal when the timing time reaches the preset time T0 (eg 5s); if the startup controller 2 If the power-on start state signal is sent within the specified time T0, it is determined that the ME of the start-up controller 2 is operating normally; if the start-up controller 2 does not send the power-on start state signal within the specified time T0, it is determined that the ME of the start-up controller 2 is running If the abnormality is detected, it is determined that the storage module 1 is abnormally activated, and the storage module 1 is switched to establish a communication connection with the repair controller 3, so that the repair controller 3 automatically performs abnormal repair
  • the logic controller 4 includes:
  • the detection module 41 connected with the startup controller 2 is used for sending the power button signal to the startup controller 2, if the power-on startup status signal returned by the startup controller 2 is not received within a preset time, the state The signal timeout result is sent to the control module 45;
  • a second communication module 43 connected to the first communication module in the repair controller 3;
  • the control module 45 connected with the detection module 41, the second communication module 43, the state register 44 and the gating module 42 respectively is used to control the gating module 42 to communicate the first gating end with the first transmission end by default under the initial situation ; After receiving the status signal timeout result, it is determined that the storage module 1 is abnormally activated, and the abnormal startup situation and the reason of the storage module 1 are recorded in the status register 44;
  • the repair controller 3 is also used to poll the status register 44 through the communication module, and when the abnormal startup of the storage module 1 is inquired, the gating module 42 controls the first gating terminal through communication with the control module 45. Switch to communicate with the second transmission end, so as to automatically perform abnormal repair processing on the storage module 1 based on the queried cause of the abnormal startup of the storage module 1 .
  • the logic controller 4 of the present application includes a detection module 41, a gating module 42, a second communication module 43, a status register 44 and a control module 45, and its working principle is:
  • the control module 45 controls the gating module 42 to connect the first gating terminal with the first transmission terminal by default, that is, to establish a communication connection between the storage module 1 and the startup controller 2 by default.
  • the level of the power button signal changes, and the power button signal is sent to the control module 45 of the logic controller 4 at this time.
  • the control module 45 sends the power button signal to the startup controller 2 via the detection module 41 .
  • the startup controller 2 establishes a communication connection with the storage module 1, it reads the ME firmware program from the storage module 1 to make the ME run, and after receiving the power button signal, uses the ME to generate a power-on startup status signal.
  • the detection module 41 starts timing when the power button signal is sent to the startup controller 2, and determines whether the startup controller 2 sends a power-on startup status signal when the timing time reaches the preset time T0; if the startup controller 2 is within the specified time T0 If the power-on start-up status signal is issued, it is determined that the power-on start-up status signal has not timed out; if the startup controller 2 does not send the power-on start-up status signal within the specified time T0, it is determined that the power-on start-up status signal has timed out, and the status signal timeout result is determined. Sent to the control module 45 .
  • the control module 45 determines that the storage module 1 is abnormally activated, and records the abnormal activation of the storage module 1 and the abnormal activation cause (eg, status signal timeout) to the status register 44 .
  • the repair controller 3 can poll (for example, 2s) the status register 44 through the communication module, and when the abnormal startup of the storage module 1 is queried from the status register 44, the gating module 42 can be controlled by communication with the control module 45 to A strobe terminal is switched to communicate with the second transmission terminal, that is, the storage module 1 is switched to establish a communication connection with the restoration controller 3, so that the restoration controller 3 is based on the abnormal startup cause of the storage module 1 queried from the status register 44. , and automatically perform abnormal repair processing on storage module 1.
  • the logic controller 4 further includes:
  • the state memory module 46 connected with the control module 45 is used to store the most recent gating state of the gating module 42; wherein, the gating state of the initial default gating module 42 of the state memory module 46 is: the first gating end and the The first transmission end is connected;
  • the control module 45 is further configured to read the last gating state of the gating module 42 from the state memory module 46 after the system AC is powered on and starts to work, and controls the gating module 42 to maintain the last gating state.
  • logic controller 4 of the present application also includes a state memory module 46, and its working principle is:
  • the state memory module 46 is used to store the latest gating state of the gating module 42 .
  • the control module 45 starts to work after the system AC (alternating current) is powered on, first reads the last gating state of the gating module 42 from the state memory module 46, and then controls the gating module 42 to maintain the last gating state, thereby The last gating state of the gating module 42 before the system is powered off is restored.
  • the initial default gating state of the gating module 42 of the state memory module 46 is: the first gating end is connected to the first transmission end, so that the control module 45 reads from the state memory module 46 in the initial condition.
  • the initial default gating state of the gating module 42 is to control the gating module 42 to connect the first gating terminal with the first transmission terminal, that is, the storage module 1 and the startup controller 2 are initially set up by default to establish a communication connection.
  • state memory module 46 can also be connected to the second communication module 43 , so that the repair controller 3 can query the state memory module 46 for the latest gating state of the gating module 42 .
  • the state memory module 46 of the present application selects non-volatile memory, and the state register 44 selects volatile memory.
  • the storage module 1 includes a main storage module 11 and a backup storage module 12 that both store system firmware programs;
  • the logic controller 4 is specifically used to establish a communication connection between the main storage module 11 and the startup controller 2 by default in an initial situation; if it is detected that the startup controller 2 fails to start the system, it is determined that the main storage module 11 is abnormally started, and the The main storage module 11 switches to establish a communication connection with the repair controller 3, and establishes a communication connection between the backup storage module 12 and the boot controller 2, so that the boot controller 2 restarts the system.
  • main storage module 11 there are two storage modules in the present application, namely the main storage module 11 and the backup storage module 12, and the working principle is as follows:
  • the logic controller 4 establishes a communication connection between the main storage module 11 and the boot controller 2 by default in an initial situation.
  • the boot controller 2 When the boot controller 2 establishes a communication connection with the main storage module 11, the boot controller 2 reads the system firmware program from the main storage module 11 to boot the system.
  • the logic controller 4 detects that the startup controller 2 starts the system, and if it detects that the startup controller 2 fails to start the system, it determines that the main storage module 11 is abnormally started, and switches the main storage module 11 to establish a communication connection with the repair controller 3 ( For the detailed working principle, refer to the above embodiments), and establish a communication connection between the backup storage module 12 and the boot controller 2, so that the boot controller 2 restarts the system.
  • the gating module 42 of the present application will not simultaneously gating two storage modules to the same controller or gating the same storage module to two controllers at the same time.
  • the gating module 42 includes two gating ends and two transmitting ends, the first gating end is connected to the main storage module 11 (Flash 1), and the second gating end is connected to the backup storage module 12 (Flash 2) , the first transmission end is connected with the startup controller 2, the second transmission end is connected with the repair controller 3, then the gating module 42 has 6 kinds of gating states (excluding all disconnected states): as shown in Figure 3, The first gating end is communicated with the first transmission end, and the Flash 1 is connected to the startup controller; as shown in Figure 4, the second gating end is communicated with the second transmission end, and the Flash 2 is connected to the repairing controller; as shown in Figure 5 Shown, the first gating end is communicated with the first transmission end, Flash 1 is connected to the startup controller, and the second gating end is communicated
  • the present application can express the gating state of the gating module 42 by means of coding, so as to perform gating control on the gating module 42, for example, Flash 1 is connected to the start controller corresponding to code 001; Flash 2 is connected to the repair controller Corresponding code 010; forward full connection corresponding to code 011; Flash 2 connected to the startup controller corresponding to code 101; Flash 1 connected to the repair controller corresponding to code 110;
  • the logic controller 4 is further configured to, after establishing a communication connection between the backup storage module 12 and the boot controller 2, determine that the backup storage module 12 is activated if it is detected that the boot controller 2 fails to restart the system abnormal;
  • the repair controller 3 is specifically configured to automatically perform abnormal repair processing on the main storage module 11 when the main storage module 11 starts abnormally and the backup storage module 12 works normally; when both the main storage module 11 and the backup storage module 12 start abnormally At the time, abnormal recovery processing is performed on the main storage module 11 and the backup storage module 12 in sequence.
  • the logic controller 4 establishes a communication connection between the backup storage module 12 and the boot controller 2 to enable the boot controller 2 to restart the system, if it is detected that the boot controller 2 fails to restart the system, it is determined that the backup storage module 12 also starts up. abnormal.
  • the repair controller 3 performs abnormal repair processing on the abnormally activated storage module.
  • the control module 45 establishes a communication connection between the main storage module 11 and the boot controller 2 by default in an initial situation.
  • the control module 45 sends the power button signal to the startup controller 2 via the detection module 41 .
  • the startup controller 2 When the startup controller 2 establishes a communication connection with the main storage module 11, it reads the ME firmware program from the main storage module 11 to make the ME run, and after receiving the power button signal, uses the ME to generate a power-on startup state Signal.
  • the detection module 41 determines whether the start-up controller 2 sends a power-on start-up state signal within a preset time; After receiving the status signal timeout result, the control module 45 determines that the main storage module 11 is abnormally activated, and records the abnormal activation of the main storage module 11 and the cause of the abnormal activation (such as the status signal timeout) to the status register 44. The control module 45 judges whether another storage module has an abnormal record according to the record of the status register 44, if not, then the control gating module 42 connects the other storage module to the startup controller 2 to restart the system; if so, explain the two storage modules. The modules all have startup exception problems.
  • the repair controller 3 polls the status register 44 of the logic controller 4, and when the repair controller 3 inquires from the status register 44 that there is an abnormal startup of the storage module, it reads the data recorded in the status register 44. The reason for the abnormal startup. If the status signal times out, the operation of fault location and repair will be carried out: if only the main storage module 11 is abnormally started in the status register 44, it means that the current backup storage module 12 is working normally, and it is necessary to locate and repair the main storage module 11.
  • the repair controller 3 switches the gating state of the gating module 42, and switches from the mode of connecting the backup storage module 12 to the startup controller 2 to the reverse full connection mode (the backup storage module 12 Connect the boot controller 2 and the main storage module 11 to the repair controller 3. It should be noted that during the mode switching, the backup storage module 12 is connected to the boot controller 2 without flashing and does not affect communication), so that the repair controller 3 The main storage module 11 is repaired. After the repair of the main storage module 11 is completed, the system is not restarted to avoid affecting services. If it is found in the status register 44 that both storage modules are abnormally activated, the repair controller 3 repairs the storage modules one by one. If it is a firmware problem, try restarting after repair to verify whether the repair is successful.
  • the logic controller 4 will perform the repairing work, and the logic controller 4 will retry each storage module multiple times (eg, 3 times) in turn to try to power on.
  • the process of performing abnormal repair processing on the main storage module includes:
  • the power supply voltage is normal, it is determined that the peripheral circuit of the main storage module is normal, and it is determined whether the main storage module can be accessed normally;
  • the backup ME firmware program is re-written to the main storage module for automatic repair, and the main storage module is switched back to reestablish the communication connection with the boot controller.
  • the reason for the abnormal operation of the ME of the startup controller may be the abnormality of the storage module, the abnormality of the surrounding circuit of the storage module, the lack of the ME firmware program in the storage module, the damage of the ME firmware program in the storage module, the ME itself In the event of a fault, etc., the startup controller cannot give a power-on startup status signal, and the system cannot enter the power-on mode.
  • the process of performing abnormal repair processing on the abnormally activated main memory module includes: 1) firstly read whether the power supply voltage of the main memory module is normal, if the power supply voltage is normal, then determine that the surrounding circuit of the main memory module is normal; if the power supply voltage If it is abnormal, it is determined that the surrounding circuit of the main storage module is abnormal, and it is determined that the server motherboard is faulty, and an alarm is reported and the motherboard is clearly replaced (hardware failure cannot automatically restore the firmware). 2) If the power supply voltage is normal, then access the main storage module. If the access is normal, determine that the main storage module is normal. If it cannot be accessed, determine that the main storage module is damaged, determine that the mainboard is faulty, report an alarm and explicitly replace the mainboard.
  • the ME firmware program in the main storage module is read, check the ME firmware program in it, calculate the relevant check value, and compare it with the check value of the ME firmware program backed up by the system. If the verification values are consistent, the comparison passes, and it is determined that the ME firmware program in the main storage module is not damaged; if the verification values are inconsistent, the comparison fails, and it is determined that the ME firmware program in the main storage module is damaged, and the backup ME firmware program is rewritten. Go to the main storage module for automatic repair, and write the main storage module to the status register of the logic controller. The main storage module has been repaired. The logic controller clears the relevant abnormal records of the main storage module, and switches the main storage module back to the startup controller.
  • Re-establish the communication connection do not restart to avoid affecting the business, and report to the operation and maintenance server at the same time, and the operation and maintenance personnel decide whether to restore the main storage module to start in the next restart. 5) If the main storage module still starts abnormally, it is determined that the ME itself is faulty, and an alarm is reported to replace the relevant parts of the ME and/or replace the main board.
  • the process of sequentially performing abnormal repair processing on the primary storage module and the backup storage module includes:
  • the power supply voltage is normal, it is determined that the peripheral circuit of the main storage module is normal, and it is determined whether the main storage module can be accessed normally;
  • the backup ME firmware program is rewritten to the main storage module for automatic repair, and the control logic controller switches the main storage module back to the main storage module after the flashing is completed.
  • control logic controller switches the backup storage module back to establish a communication connection with the boot controller, and controls the boot controller to restart the system to determine whether the system can be restarted normally; if the system can be restarted normally, determine the main storage module and spare storage modules have been repaired;
  • the control logic controller switches the main storage module back to establish a communication connection with the boot controller, and controls the boot controller to restart the system to determine whether the system can be restarted normally; if the system can be restarted normally Restart, directly re-flash the backup ME firmware program to the backup storage module, and control the logic controller to switch back the backup storage module to establish a communication connection with the boot controller after the flashing is completed, and control the boot controller Restart the system to determine whether the system can be restarted normally; if the system can be restarted normally, it is determined that the main storage module and the backup storage module have been repaired;
  • the process of sequentially performing abnormal repair processing on the main storage module and the backup storage module includes: 1) first reading whether the power supply voltage of the main storage module is normal, and if the power supply voltage is normal, then determining that the surrounding circuits of the main storage module are normal; if If the power supply voltage is abnormal, it is determined that the surrounding circuit of the main storage module is abnormal, and it is determined that the server motherboard is faulty, and an alarm is reported and the motherboard is clearly replaced. 2) If the power supply voltage is normal, then access the main storage module. If the access is normal, determine that the main storage module is normal. If it cannot be accessed, determine that the main storage module is damaged, determine that the mainboard is faulty, report an alarm and explicitly replace the mainboard.
  • the ME firmware program in the main storage module If the ME firmware program in the main storage module is read, check the ME firmware program in it, calculate the relevant check value, and compare it with the check value of the ME firmware program backed up by the system. If the verification values are consistent, the comparison passes, and it is determined that the ME firmware program in the main storage module is not damaged; if the verification values are inconsistent, the comparison fails, and it is determined that the ME firmware program in the main storage module is damaged, and the backup ME firmware program is rewritten.
  • the control logic controller switches the main storage module back to establish a communication connection with the boot controller, and controls the boot controller to restart the system to determine whether the system can be restarted normally; If the system can be restarted normally, directly re-flash the backup ME firmware program to the backup storage module, and after the flashing is completed, the control logic controller switches the backup storage module back to establish a communication connection with the startup controller, and controls the Start the controller to restart the system, and judge whether the system can restart normally; if the system can restart normally, determine that the main storage module and the backup storage module have been repaired, and write the main storage module and the backup storage module into the status register of the logic controller It has been repaired.
  • the logic controller clears the relevant abnormal records of the main storage module and the backup storage module, and switches the main storage module back to the startup controller to re-establish the communication connection, restart the system and report to the operation and maintenance server to complete the automatic firmware recovery. 5) If the system restarts abnormally, it is determined that the ME itself is faulty, and an alarm is reported to replace the relevant parts of the ME and/or replace the main board.
  • system firmware program further includes a BIOS firmware program; the startup controller 2 is connected to the repair controller 3;
  • the startup controller 2 is also used to read the BIOS firmware program from the storage module to start the BIOS after the system is powered on;
  • the repair controller 3 is also used for judging abnormal BIOS startup during the process of booting the controller 2 to start the BIOS, and automatically performing BIOS firmware program abnormal repair processing on the storage module when the BIOS starts abnormally due to a problem with the BIOS firmware program.
  • system firmware program of the application also includes BIOS (Basic Input Output System, Basic Input Output System) firmware program, and the startup controller 2 is also connected with the repair controller 3, and its working principle is:
  • the repair controller 3 performs BIOS startup abnormality judgment during the process of starting the controller 2 to start the BIOS, and when the BIOS starts abnormally due to a problem with the BIOS firmware program, it automatically performs BIOS firmware program exception repair processing on the storage module (with the ME firmware program).
  • BIOS firmware program exception repair processing is similar, and details are not described herein again in this application).
  • the startup controller 2 is a CPU; the repair controller 3 is a BMC; and the logic controller 4 is a CPLD.
  • the old version of the server includes PCH (Platform Controller Hub, integrated south bridge), and ME is integrated in the PCH;
  • PCH Plate Controller Hub, integrated south bridge
  • ME is integrated in the PCH;
  • the new version of the server removes the PCH, and integrates the functions of the PCH into the CPU (central processing unit), and the ME is also in the CPU.
  • the startup controller 2 of the present application selects CPU, the repair controller 3 selects BMC (Baseboard Management Controller, baseboard management controller); the logic controller 4 selects CPLD (Complex Programmable Logic Device, complex programmable logic device).
  • BMC Baseboard Management Controller
  • CPLD Complex Programmable Logic Device, complex programmable logic device
  • the present application can also visually display the current gating state of the gating module, which storage module the system currently uses to start, the status of the two storage modules, etc. on the web (global wide area network) page, and the user can manually switch which one to use.
  • Storage module startup configuration manual switching of the gating state of the gating module, and other functions.
  • the user can also remotely query the current gating status of the gating module, which storage module the system is currently using to start, the status of the two storage modules and other API (Application Programming Interface) commands through the BMC, and the user can manually switch to use it.
  • API commands such as configuration of which storage module to start, and manual switching of the gating state of the gating module.
  • the present application also provides a server, including any of the above-mentioned server firmware self-recovery systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)
  • Stored Programmes (AREA)

Abstract

一种服务器固件自恢复系统及服务器,包括存储模块(1)、启动控制器(2)、修复控制器(3)及逻辑控制器(4),所述存储模块(1)用于存储系统固件程序;所述启动控制器(2)用于在自身与所述存储模块(1)建立通信连接的情况下,从所述存储模块(1)中读取系统固件程序以启动系统;所述修复控制器(3)用于在所述存储模块(1)启动异常且自身与所述存储模块(1)建立通信连接的情况下,自动对所述存储模块(1)进行异常修复处理;所述逻辑控制器(4)用于在初始情况下默认将所述存储模块(1)与所述启动控制器(2)建立通信连接;若检测到所述启动控制器(2)启动系统失败,则确定所述存储模块(1)启动异常,并将所述存储模块(1)切换至与所述修复控制器(3)建立通信连接。该固件自恢复系统减少了运维时间,提升了运维效率,减轻了运维工作压力。

Description

一种服务器固件自恢复系统及服务器
本申请要求在2021年3月26日提交中国专利局、申请号为202110326283.9、发明名称为“一种服务器固件自恢复系统及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及服务器运维领域,特别是涉及一种服务器固件自恢复系统及服务器。
背景技术
随着数据中心的规模越来越大,千万台级别的服务器给运维工作带来巨大的挑战。目前,服务器系统的固件程序通常存储于Flash(闪存)芯片中,通过读取Flash芯片中的固件程序以完成系统启动。但是,系统有时会因Flash芯片或其内固件程序出现问题而启动失败,但现有的运维工作只能做到将系统启动失败情况上报给运维服务器,后续还需维修人员对Flash芯片进行异常修复处理,导致运维时间较长,运维效率较低,运维工作压力较大。
因此,如何提供一种解决上述技术问题的方案是本领域的技术人员目前需要解决的问题。
发明内容
本申请的目的是提供一种服务器固件自恢复系统及服务器,可对启动异常的存储模块自动进行异常修复处理,从而减少了运维时间,提升了运维效率,减轻了运维工作压力。
为解决上述技术问题,本申请提供了一种服务器固件自恢复系统,包括:
存储模块,用于存储系统固件程序;
启动控制器,用于在自身与所述存储模块建立通信连接的情况下,从所述存储模块中读取所述系统固件程序以启动系统;
修复控制器,用于在所述存储模块启动异常且自身与所述存储模块建立通信连接的情况下,自动对所述存储模块进行异常修复处理;
分别与所述存储模块、所述启动控制器及所述修复控制器连接的逻辑控制器,用于在初始情况下默认将所述存储模块与所述启动控制器建立通信连接;若检测到所述启动控制器启动系统失败,则确定所述存储模块启动异常,并将所述存储模块切换至与所述修复控制器建立通信连接。
优选地,所述启动控制器包含ME;所述系统固件程序包括ME固件程序;
相应的,所述启动控制器具体用于在自身与所述存储模块建立通信连接的情况下,从所述存储模块中读取所述ME固件程序以使所述ME运行,且在接收到电源按键信号后利用所述ME生成上电启动状态信号;
所述逻辑控制器具体用于接收系统的电源按键信号,并将所述电源按键信号发送所述启动控制器;判断在预设时间内是否接收到所述启动控制器返回的上电启动状态信号,若是,则控制系统硬件上电以启动系统;若否,则确定所述存储模块启动异常,并将所述存储模块切换至与所述修复控制器建立通信连接。
优选地,所述逻辑控制器包括:
与所述启动控制器连接的检测模块,用于在将所述电源按键信号发送至所述启动控制器后,若在预设时间内未接收到所述启动控制器返回的上电启动状态信号,则将状态信号超时结果发送至控制模块;
第一选通端与所述存储模块连接、第一传输端与所述启动控制器连接、第二传输端与所述修复控制器连接的选通模块;
与所述修复控制器内第一通信模块连接的第二通信模块;
与所述第二通信模块连接的状态寄存器;
分别与所述检测模块、所述第二通信模块、所述状态寄存器及所述选通模块连接的控制模块,用于在初始情况下控制所述选通模块默认将所述第一选通端与所述第一传输端连通;在接收到所述状态信号超时结果后确定所述存储模块启动异常,并将所述存储模块的启动异常情况及原因记录至所述状态寄存器;
相应的,所述修复控制器还用于通过通信模块轮询所述状态寄存器,并在查询到所述存储模块的启动异常情况时,通过与所述控制模块的通信控制所述选通模块将所述第一选通端切换至与所述第二传输端连通,以基于查询到的所述存储模块的启动异常原因自动对所述存储模块进行异常修复处理。
优选地,所述逻辑控制器还包括:
与所述控制模块连接的状态记忆模块,用于存储所述选通模块最近一次的选通状态;其中,所述状态记忆模块初始默认的所述选通模块的选通状态为:所述第一选通端与所述第一传输端连通;
所述控制模块还用于在系统AC上电且自身开始工作后,从所述状态记忆模块读取所述选通模块上次的选通状态,并控制所述选通模块保持上次的选通状态。
优选地,所述存储模块包括均存储有系统固件程序的主存储模块和备用存储模块;
相应的,所述逻辑控制器具体用于在初始情况下默认将所述主存储模块与所述启动控制器建立通信连接;若检测到所述启动控制器启动系统失败,则确定所述主存储模块启动异常,将所述主存储模块切换至与所述修复控制器建立通信连接,并将所述备用存储模块与所述启动控制器建立通信连接,以使所述启动控制器重启系统。
优选地,所述逻辑控制器还用于在将所述备用存储模块与所述启动控制器建立通信连接后,若检测到所述启动控制器重启系统失败,则确定所述备用存储模块启动异常;
相应的,所述修复控制器具体用于在所述主存储模块启动异常、所述备用存储模块正常工作时,自动对所述主存储模块进行异常修复处理;在所述主存储模块和所述备用存储模块均启动异常时,依次对所述主存储模块和所述备用存储模块进行异常修复处理。
优选地,对所述主存储模块进行异常修复处理的过程,包括:
判断所述主存储模块的供电电压是否正常;
若供电电压异常,则确定所述主存储模块的周围电路异常;
若供电电压正常,则确定所述主存储模块的周围电路正常,并判断是否可正常访问到所述主存储模块;
若不可正常访问到,则确定所述主存储模块损坏;
若可正常访问到,则确定所述主存储模块正常,并判断是否可读取到所述主存储模块内的ME固件程序;
若不可读取到,则确定所述主存储模块缺失所述ME固件程序,并将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,且将所述主存储模块切回至与所述启动控制器重新建立通信连接;
若可读取到,则对所述主存储模块内的ME固件程序进行校验,若校验失败,则确定所述主存储模块内的ME固件程序损坏,并将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,且将所述主存储模块切回至与所述启动控制器重新建立通信连接;
若所述主存储模块依旧启动异常,则确定所述ME自身故障。
优选地,依次对所述主存储模块和所述备用存储模块进行异常修复处理的过程,包括:
判断所述主存储模块的供电电压是否正常;
若供电电压异常,则确定所述主存储模块的周围电路异常;
若供电电压正常,则确定所述主存储模块的周围电路正常,并判断是否可正常访问到所述主存储模块;
若不可正常访问到,则确定所述主存储模块损坏;
若可正常访问到,则确定所述主存储模块正常,并判断是否可读取到所述主存储模块内的ME固件程序;
若不可读取到,则确定所述主存储模块缺失所述ME固件程序,将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,并在刷写完成后控制所述逻辑控制器将所述主存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至所述备用存储模块中,并在刷写完成后控制所述逻辑控制器将所述备用存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否 可以正常重启;若系统可以正常重启,则确定所述主存储模块和所述备用存储模块已修复;
若可读取到,则对所述主存储模块内的ME固件程序进行校验,若校验失败,则确定所述主存储模块内的ME固件程序损坏,并将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,并在刷写完成后控制所述逻辑控制器将所述主存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至所述备用存储模块中,并在刷写完成后控制所述逻辑控制器将所述备用存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定所述主存储模块和所述备用存储模块已修复;
若系统重启异常,则确定所述ME自身故障。
优选地,所述系统固件程序还包括BIOS固件程序;所述启动控制器与所述修复控制器连接;
所述启动控制器还用于在系统上电完成后,从所述存储模块中读取所述BIOS固件程序以启动BIOS;
所述修复控制器还用于在所述启动控制器启动BIOS的过程中进行BIOS启动异常判断,并在所述BIOS因所述BIOS固件程序存在问题而启动异常时,自动对所述存储模块进行BIOS固件程序异常修复处理。
优选地,所述启动控制器为CPU;所述修复控制器为BMC;所述逻辑控制器为CPLD。
为解决上述技术问题,本申请还提供了一种服务器,包括上述任一种服务器固件自恢复系统。
本申请提供了一种服务器固件自恢复系统,包括存储模块、启动控制器、修复控制器及逻辑控制器,存储模块用于存储系统固件程序;启动控制器用于在自身与存储模块建立通信连接的情况下,从存储模块中读取系统固件程序以启动系统;修复控制器用于在存储模块启动异常且自身与存储模块建立通信连接的情况下,自动对存储模块进行异常修复处理;逻辑控制器用于在初始情况下默认将存储模块与启动控制器建立通信连接;若检测到启动控制器启动系统 失败,则确定存储模块启动异常,并将存储模块切换至与修复控制器建立通信连接。可见,本申请可对启动异常的存储模块自动进行异常修复处理,从而减少了运维时间,提升了运维效率,减轻了运维工作压力。
本申请还提供了一种服务器,与上述固件自恢复系统具有相同的有益效果。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对现有技术和实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种服务器固件自恢复系统的结构示意图;
图2为本申请实施例提供的一种服务器固件自恢复系统的具体结构示意图;
图3为本申请实施例提供的第一种选通模块的连接结构示意图;
图4为本申请实施例提供的第二种选通模块的连接结构示意图;
图5为本申请实施例提供的第三种选通模块的连接结构示意图;
图6为本申请实施例提供的第四种选通模块的连接结构示意图;
图7为本申请实施例提供的第五种选通模块的连接结构示意图;
图8为本申请实施例提供的第六种选通模块的连接结构示意图。
具体实施方式
本申请的核心是提供一种服务器固件自恢复系统及服务器,可对启动异常的存储模块自动进行异常修复处理,从而减少了运维时间,提升了运维效率,减轻了运维工作压力。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其 他实施例,都属于本申请保护的范围。
请参照图1,图1为本申请实施例提供的一种服务器固件自恢复系统的结构示意图。
该服务器固件自恢复系统包括:
存储模块1,用于存储系统固件程序;
启动控制器2,用于在自身与存储模块1建立通信连接的情况下,从存储模块1中读取系统固件程序以启动系统;
修复控制器3,用于在存储模块1启动异常且自身与存储模块1建立通信连接的情况下,自动对存储模块1进行异常修复处理;
分别与存储模块1、启动控制器2及修复控制器3连接的逻辑控制器4,用于在初始情况下默认将存储模块1与启动控制器2建立通信连接;若检测到启动控制器2启动系统失败,则确定存储模块1启动异常,并将存储模块1切换至与修复控制器3建立通信连接。
具体地,本申请的服务器固件自恢复系统包括存储模块1、启动控制器2、修复控制器3及逻辑控制器4,其工作原理为:
存储模块1用于存储系统固件程序。逻辑控制器4在初始情况下默认将存储模块1与启动控制器2建立通信连接。启动控制器2在自身与存储模块建立通信连接的情况下,从存储模块1中读取系统固件程序以启动系统。逻辑控制器4检测启动控制器2启动系统的情况,若检测到启动控制器2启动系统失败,则确定存储模块1启动异常,并将存储模块1切换至与修复控制器3建立通信连接。修复控制器3在存储模块1启动异常且自身与存储模块1建立通信连接的情况下,自动对存储模块1进行异常修复处理。
可见,本申请可对启动异常的存储模块1自动进行异常修复处理,从而减少了运维时间,提升了运维效率,减轻了运维工作压力。
在上述实施例的基础上:
请参照图2,图2为本申请实施例提供的一种服务器固件自恢复系统的具体结构示意图。
作为一种可选的实施例,启动控制器2包含ME;系统固件程序包括ME固件程序;
相应的,启动控制器2具体用于在自身与存储模块1建立通信连接的情况下,从存储模块1中读取ME固件程序以使ME运行,且在接收到电源按键信号后利用ME生成上电启动状态信号;
逻辑控制器4具体用于接收系统的电源按键信号,并将电源按键信号发送启动控制器2;判断在预设时间内是否接收到启动控制器2返回的上电启动状态信号,若是,则控制系统硬件上电以启动系统;若否,则确定存储模块1启动异常,并将存储模块1切换至与修复控制器3建立通信连接。
具体地,本申请的启动控制器2包含ME(Management Engine,管理引擎),系统固件程序包括ME固件程序。基于此,启动控制器2和逻辑控制器4的工作原理为:
当服务器系统的电源按键被按下或者服务器系统接收到远程发送的开机指令时,电源按键信号的电平发生变化,此时电源按键信号被发送到逻辑控制器4。逻辑控制器4在接收系统的电源按键信号后,将电源按键信号发送到启动控制器2。启动控制器2在自身与存储模块1建立通信连接的情况下,从存储模块1中读取ME固件程序以使ME运行,且在接收到电源按键信号后,利用ME生成上电启动状态信号(S3状态信号)。
考虑到若启动控制器2的ME运行异常(等同于存储模块1启动异常),启动控制器2无法发出上电启动状态信号,对应现象就是系统无法开机,即启动控制器2启动系统失败,所以逻辑控制器4从将电源按键信号发送启动控制器2时开始计时,判断在计时时间到达预设时间T0(如5s)时,启动控制器2是否发出上电启动状态信号;若启动控制器2在规定时间T0内发出上电启动状态信号,则确定启动控制器2的ME运行正常;若启动控制器2在规定时间T0内未发出上电启动状态信号,则确定启动控制器2的ME运行异常,并确定存储模块1启动异常,将存储模块1切换至与修复控制器3建立通信连接,以由修复控制器3自动对存储模块1进行异常修复处理。
作为一种可选的实施例,逻辑控制器4包括:
与启动控制器2连接的检测模块41,用于在将电源按键信号发送至启动控制器2后,若在预设时间内未接收到启动控制器2返回的上电启动状态信号,则将状态信号超时结果发送至控制模块45;
第一选通端与存储模块1连接、第一传输端与启动控制器2连接、第二传输端与修复控制器3连接的选通模块42;
与修复控制器3内第一通信模块连接的第二通信模块43;
与第二通信模块43连接的状态寄存器44;
分别与检测模块41、第二通信模块43、状态寄存器44及选通模块42连接的控制模块45,用于在初始情况下控制选通模块42默认将第一选通端与第一传输端连通;在接收到状态信号超时结果后确定存储模块1启动异常,并将存储模块1的启动异常情况及原因记录至状态寄存器44;
相应的,修复控制器3还用于通过通信模块轮询状态寄存器44,并在查询到存储模块1的启动异常情况时,通过与控制模块45的通信控制选通模块42将第一选通端切换至与第二传输端连通,以基于查询到的存储模块1的启动异常原因自动对存储模块1进行异常修复处理。
具体地,本申请的逻辑控制器4包括检测模块41、选通模块42、第二通信模块43、状态寄存器44及控制模块45,其工作原理为:
控制模块45在初始情况下控制选通模块42默认将第一选通端与第一传输端连通,即在初始情况下默认将存储模块1与启动控制器2建立通信连接。当服务器系统的电源按键被按下或者服务器系统接收到远程发送的开机指令时,电源按键信号的电平发生变化,此时电源按键信号被发送到逻辑控制器4的控制模块45。控制模块45经检测模块41将电源按键信号发送至启动控制器2。启动控制器2在自身与存储模块1建立通信连接的情况下,从存储模块1中读取ME固件程序以使ME运行,且在接收到电源按键信号后,利用ME生成上电启动状态信号。
检测模块41从将电源按键信号发送启动控制器2时开始计时,判断在计时时间到达预设时间T0时,启动控制器2是否发出上电启动状态信号;若启动控制器2在规定时间T0内发出上电启动状态信号,则确定上电启动状态信 号未超时;若启动控制器2在规定时间T0内未发出上电启动状态信号,则确定上电启动状态信号超时,并将状态信号超时结果发送至控制模块45。
控制模块45在接收到状态信号超时结果后,确定存储模块1启动异常,并将存储模块1的启动异常情况及其启动异常原因(如状态信号超时)记录至状态寄存器44。修复控制器3可通过通信模块轮询(如2s)状态寄存器44,并在从状态寄存器44中查询到存储模块1的启动异常情况时,通过与控制模块45的通信控制选通模块42将第一选通端切换至与第二传输端连通,即将存储模块1切换至与修复控制器3建立通信连接,以使修复控制器3基于从状态寄存器44中查询到的存储模块1的启动异常原因,自动对存储模块1进行异常修复处理。
作为一种可选的实施例,逻辑控制器4还包括:
与控制模块45连接的状态记忆模块46,用于存储选通模块42最近一次的选通状态;其中,状态记忆模块46初始默认的选通模块42的选通状态为:第一选通端与第一传输端连通;
控制模块45还用于在系统AC上电且自身开始工作后,从状态记忆模块46读取选通模块42上次的选通状态,并控制选通模块42保持上次的选通状态。
进一步地,本申请的逻辑控制器4还包括状态记忆模块46,其工作原理为:
状态记忆模块46用于存储选通模块42最近一次的选通状态。控制模块45在系统AC(交流电)上电后开始工作,首先从状态记忆模块46中读取选通模块42上次的选通状态,然后控制选通模块42保持上次的选通状态,从而恢复选通模块42在系统下电前的最后一次选通状态。需要说明的是,状态记忆模块46初始默认的选通模块42的选通状态为:第一选通端与第一传输端连通,使得控制模块45在初始情况下从状态记忆模块46中读取初始默认的选通模块42的选通状态,以控制选通模块42将第一选通端与第一传输端连通,即在初始情况下默认将存储模块1与启动控制器2建立通信连接。
此外,状态记忆模块46还可与第二通信模块43连接,以使修复控制器3可从状态记忆模块46中查询到选通模块42最近一次的选通状态。本申请的状态记忆模块46选用非易失性存储器,状态寄存器44选用易失性存储器。
作为一种可选的实施例,存储模块1包括均存储有系统固件程序的主存储模块11和备用存储模块12;
相应的,逻辑控制器4具体用于在初始情况下默认将主存储模块11与启动控制器2建立通信连接;若检测到启动控制器2启动系统失败,则确定主存储模块11启动异常,将主存储模块11切换至与修复控制器3建立通信连接,并将备用存储模块12与启动控制器2建立通信连接,以使启动控制器2重启系统。
具体地,本申请的存储模块有两个,分别是主存储模块11和备用存储模块12,其工作原理为:
逻辑控制器4在初始情况下默认将主存储模块11与启动控制器2建立通信连接。启动控制器2在自身与主存储模块11建立通信连接的情况下,从主存储模块11中读取系统固件程序以启动系统。逻辑控制器4检测启动控制器2启动系统的情况,若检测到启动控制器2启动系统失败,则确定主存储模块11启动异常,将主存储模块11切换至与修复控制器3建立通信连接(详细工作原理可参照上述实施例),并将备用存储模块12与启动控制器2建立通信连接,以使启动控制器2重启系统。
需要说明的是,本申请的选通模块42不会将两个存储模块同时选通到同一控制器或将同一存储模块同时选通到两个控制器。具体地,选通模块42包括两个选通端和两个传输端,第一选通端与主存储模块11(Flash 1)连接、第二选通端与备用存储模块12(Flash 2)连接、第一传输端与启动控制器2连接、第二传输端与修复控制器3连接,则选通模块42有6种选通状态(不包括全部断开的状态):如图3所示,第一选通端与第一传输端连通,Flash 1连接到启动控制器;如图4所示,第二选通端与第二传输端连通,Flash 2连接到修复控制器;如图5所示,第一选通端与第一传输端连通,Flash 1连接到启动控制器,且第二选通端与第二传输端连通,Flash 2连接到修复控制器;如图6所示,第二选通端与第一传输端连通,Flash 2连接到启动控制器;如 图7所示,第一选通端与第二传输端连通,Flash 1连接到修复控制器;如图8所示,第一选通端与第二传输端连通,Flash 1连接到修复控制器,且第二选通端与第一传输端连通,Flash 2连接到启动控制器。其中,两个存储模块均采用SPI(Serial Peripheral Interface,串行外设接口)通信协议与所连接的控制器通信。
基于此,本申请可采用编码方式表示选通模块42的选通状态,以对选通模块42进行选通控制,比如,Flash 1连接到启动控制器对应编码001;Flash 2连接到修复控制器对应编码010;正向全连接对应编码011;Flash 2连接到启动控制器对应编码101;Flash 1连接到修复控制器对应编码110;交叉全连接对应编码111。
作为一种可选的实施例,逻辑控制器4还用于在将备用存储模块12与启动控制器2建立通信连接后,若检测到启动控制器2重启系统失败,则确定备用存储模块12启动异常;
相应的,修复控制器3具体用于在主存储模块11启动异常、备用存储模块12正常工作时,自动对主存储模块11进行异常修复处理;在主存储模块11和备用存储模块12均启动异常时,依次对主存储模块11和备用存储模块12进行异常修复处理。
进一步地,逻辑控制器4在将备用存储模块12与启动控制器2建立通信连接以使启动控制器2重启系统后,若检测到启动控制器2重启系统失败,则确定备用存储模块12也启动异常。修复控制器3对启动异常的存储模块进行异常修复处理。
基于上述实施例,更具体地,控制模块45在初始情况下默认将主存储模块11与启动控制器2建立通信连接。控制模块45经检测模块41将电源按键信号发送至启动控制器2。启动控制器2在自身与主存储模块11建立通信连接的情况下,从主存储模块11中读取ME固件程序以使ME运行,且在接收到电源按键信号后,利用ME生成上电启动状态信号。检测模块41判断在预设时间内启动控制器2是否发出上电启动状态信号;若否,则确定上电启动状态信号超时,并将状态信号超时结果发送至控制模块45。控制模块45在接收到状态信号超时结果后,确定主存储模块11启动异常,并将主存储模块11的 启动异常情况及其启动异常原因(如状态信号超时)记录至状态寄存器44。控制模块45根据状态寄存器44的记录判断另一个存储模块是否有异常记录,若没有,则控制选通模块42将另一个存储模块连接到启动控制器2重新启动系统;若有,说明两个存储模块都存在启动异常问题。
若修复控制器3正常工作,则修复控制器3轮询逻辑控制器4的状态寄存器44,当修复控制器3从状态寄存器44中查询到有存储模块启动异常时,读取状态寄存器44记录的启动异常原因,如果是状态信号超时,则将进行故障定位和尝试修复的操作:如果状态寄存器44中只查询到主存储模块11启动异常,说明当前备用存储模块12正常工作,为定位和修复主存储模块11的问题且不影响系统工作,则修复控制器3切换选通模块42的选通状态,从备用存储模块12连接启动控制器2的模式切换到反向全连接模式(备用存储模块12连接启动控制器2、主存储模块11连接修复控制器3,需要说明的是,模式切换期间,备用存储模块12连接启动控制器2不会闪断,不影响通信),以由修复控制器3修复主存储模块11,主存储模块11修复完成后,不进行系统重启,以避免影响业务。如果状态寄存器44中查询到两个存储模块均启动异常,则修复控制器3逐个修复存储模块,如果是固件问题,修复后尝试重启,验证修复是否成功。
若修复控制器3未正常工作,则由逻辑控制器4进行修复工作,逻辑控制器4会轮流重试每个存储模块多次(如3次),尝试开机。
作为一种可选的实施例,对主存储模块进行异常修复处理的过程,包括:
判断主存储模块的供电电压是否正常;
若供电电压异常,则确定主存储模块的周围电路异常;
若供电电压正常,则确定主存储模块的周围电路正常,并判断是否可正常访问到主存储模块;
若不可正常访问到,则确定主存储模块损坏;
若可正常访问到,则确定主存储模块正常,并判断是否可读取到主存储模块内的ME固件程序;
若不可读取到,则确定主存储模块缺失ME固件程序,并将备份的ME固件程序重新刷写至主存储模块进行自动修复,且将主存储模块切回至与启动控制器重新建立通信连接;
若可读取到,则对主存储模块内的ME固件程序进行校验,若校验失败,则确定主存储模块内的ME固件程序损坏,并将备份的ME固件程序重新刷写至主存储模块进行自动修复,且将主存储模块切回至与启动控制器重新建立通信连接;
若主存储模块依旧启动异常,则确定ME自身故障。
具体地,启动控制器的ME运行异常(等同于存储模块启动异常)的原因可能是存储模块异常、存储模块的周围电路异常、存储模块缺失ME固件程序、存储模块内ME固件程序损坏、ME自身故障等情况,导致启动控制器无法给出上电启动状态信号,进而系统无法进入上电模式。基于此,对启动异常的主存储模块进行异常修复处理的过程包括:1)先读取主存储模块的供电电压是否正常,如果供电电压正常,则确定主存储模块的周围电路正常;如果供电电压异常,则确定主存储模块的周围电路异常,判定为服务器主板故障,上报告警并明确更换主板(硬件故障无法自动恢复固件)。2)如果供电电压正常,则访问主存储模块,如果访问正常,则确定主存储模块正常,如果无法访问,则确定主存储模块损坏,判定为主板故障,上报告警并明确更换主板。3)如果主存储模块正常,判断是否可读取到主存储模块内的ME固件程序,如果读取不到主存储模块内的ME固件程序,则记录本次开机异常的原因为主存储模块缺失ME固件程序,将备份的ME固件程序重新刷写至主存储模块进行自动修复,并往逻辑控制器的状态寄存器中写入主存储模块已修复,由逻辑控制器清除主存储模块的相关异常记录,且将主存储模块切回至与启动控制器重新建立通信连接,不做重启避免影响业务,同时上报运维服务器,由运维人员决定在下次重启时是否恢复到主存储模块启动。4)如果读取到主存储模块内的ME固件程序,则对其内ME固件程序进行校验,计算相关的校验值,并与系统备份的ME固件程序的校验值进行对比,如果校验值一致则对比通过,判定为主存储模块内的ME固件程序未损坏;如果校验值不一致则对比失败,判定为主存储模块内的ME固件程序损坏,将备份的ME固件程序重新刷写至主存储模 块进行自动修复,并往逻辑控制器的状态寄存器中写入主存储模块已修复,由逻辑控制器清除主存储模块的相关异常记录,且将主存储模块切回至与启动控制器重新建立通信连接,不做重启避免影响业务,同时上报运维服务器,由运维人员决定在下次重启时是否恢复到主存储模块启动。5)如果主存储模块依旧启动异常,则判定为ME自身故障,并上报告警,更换ME的相关部件/或更换主板。
作为一种可选的实施例,依次对主存储模块和备用存储模块进行异常修复处理的过程,包括:
判断主存储模块的供电电压是否正常;
若供电电压异常,则确定主存储模块的周围电路异常;
若供电电压正常,则确定主存储模块的周围电路正常,并判断是否可正常访问到主存储模块;
若不可正常访问到,则确定主存储模块损坏;
若可正常访问到,则确定主存储模块正常,并判断是否可读取到主存储模块内的ME固件程序;
若不可读取到,则确定主存储模块缺失ME固件程序,将备份的ME固件程序重新刷写至主存储模块进行自动修复,并在刷写完成后控制逻辑控制器将主存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至备用存储模块中,并在刷写完成后控制逻辑控制器将备用存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定主存储模块和备用存储模块已修复;
若可读取到,则对主存储模块内的ME固件程序进行校验,若校验失败,则确定主存储模块内的ME固件程序损坏,并将备份的ME固件程序重新刷写至主存储模块进行自动修复,并在刷写完成后控制逻辑控制器将主存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至备用存储模块中,并在刷写完成后控制逻辑控制器将备用存储模块 切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定主存储模块和备用存储模块已修复;
若系统重启异常,则确定ME自身故障。
具体地,依次对主存储模块和备用存储模块进行异常修复处理的过程包括:1)先读取主存储模块的供电电压是否正常,如果供电电压正常,则确定主存储模块的周围电路正常;如果供电电压异常,则确定主存储模块的周围电路异常,判定为服务器主板故障,上报告警并明确更换主板。2)如果供电电压正常,则访问主存储模块,如果访问正常,则确定主存储模块正常,如果无法访问,则确定主存储模块损坏,判定为主板故障,上报告警并明确更换主板。3)如果主存储模块正常,判断是否可读取到主存储模块内的ME固件程序,如果读取不到主存储模块内的ME固件程序,则记录本次开机异常的原因为主存储模块缺失ME固件程序,将备份的ME固件程序重新刷写至主存储模块进行自动修复,并在刷写完成后控制逻辑控制器将主存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至备用存储模块中,并在刷写完成后控制逻辑控制器将备用存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定主存储模块和备用存储模块已修复,并往逻辑控制器的状态寄存器中写入主存储模块和备用存储模块已修复,由逻辑控制器清除主存储模块和备用存储模块的相关异常记录,且将主存储模块切回至与启动控制器重新建立通信连接,重启系统同时上报运维服务器,完成固件自动恢复。4)如果读取到主存储模块内的ME固件程序,则对其内ME固件程序进行校验,计算相关的校验值,并与系统备份的ME固件程序的校验值进行对比,如果校验值一致则对比通过,判定为主存储模块内的ME固件程序未损坏;如果校验值不一致则对比失败,判定为主存储模块内的ME固件程序损坏,将备份的ME固件程序重新刷写至主存储模块进行自动修复,并在刷写完成后控制逻辑控制器将主存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则 直接重新将备份的ME固件程序重新刷写至备用存储模块中,并在刷写完成后控制逻辑控制器将备用存储模块切回至与启动控制器建立通信连接,且控制启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定主存储模块和备用存储模块已修复,并往逻辑控制器的状态寄存器中写入主存储模块和备用存储模块已修复,由逻辑控制器清除主存储模块和备用存储模块的相关异常记录,且将主存储模块切回至与启动控制器重新建立通信连接,重启系统同时上报运维服务器,完成固件自动恢复。5)如果系统重启异常,则判定为ME自身故障,并上报告警,更换ME的相关部件/或更换主板。
作为一种可选的实施例,系统固件程序还包括BIOS固件程序;启动控制器2与修复控制器3连接;
启动控制器2还用于在系统上电完成后,从存储模块中读取BIOS固件程序以启动BIOS;
修复控制器3还用于在启动控制器2启动BIOS的过程中进行BIOS启动异常判断,并在BIOS因BIOS固件程序存在问题而启动异常时,自动对存储模块进行BIOS固件程序异常修复处理。
进一步地,本申请的系统固件程序还包括BIOS(Basic Input Output System,基本输入输出系统)固件程序,启动控制器2还与修复控制器3连接,其工作原理为:
修复控制器3在启动控制器2启动BIOS的过程中进行BIOS启动异常判断,并在BIOS因BIOS固件程序存在问题而启动异常时,自动对存储模块进行BIOS固件程序异常修复处理(与ME固件程序异常修复处理原理类似,本申请在此不再赘述)。
作为一种可选的实施例,启动控制器2为CPU;修复控制器3为BMC;逻辑控制器4为CPLD。
具体地,旧版本的服务器包含PCH(Platform Controller Hub,集成南桥),ME集成在PCH中;新版本的服务器去掉了PCH,将PCH的功能集成在CPU(中央处理器)中,ME也在CPU中。
本申请的启动控制器2选用CPU,修复控制器3选用BMC(Baseboard Management Controller,基板管理控制器);逻辑控制器4选用CPLD(Complex Programmable Logic Device,复杂可编程逻辑器件)。
此外,本申请还可在web(全球广域网)页面上对选通模块的当前选通状态、系统当前使用哪个存储模块启动、两个存储模块的状态等进行直观显示,并由用户手动切换使用哪个存储模块启动的配置、手动切换选通模块的选通状态等功能。用户还可远程透过BMC查询选通模块的当前选通状态、系统当前使用哪个存储模块启动、两个存储模块的状态等API(Application Programming Interface,应用程序接口)命令,并由用户手动切换使用哪个存储模块启动的配置、手动切换选通模块的选通状态等API命令。
本申请还提供了一种服务器,包括上述任一种服务器固件自恢复系统。
本申请提供的服务器的介绍请参考上述固件自恢复系统的实施例,本申请在此不再赘述。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其他实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (11)

  1. 一种服务器固件自恢复系统,其特征在于,包括:
    存储模块,用于存储系统固件程序;
    启动控制器,用于在自身与所述存储模块建立通信连接的情况下,从所述存储模块中读取所述系统固件程序以启动系统;
    修复控制器,用于在所述存储模块启动异常且自身与所述存储模块建立通信连接的情况下,自动对所述存储模块进行异常修复处理;
    分别与所述存储模块、所述启动控制器及所述修复控制器连接的逻辑控制器,用于在初始情况下默认将所述存储模块与所述启动控制器建立通信连接;若检测到所述启动控制器启动系统失败,则确定所述存储模块启动异常,并将所述存储模块切换至与所述修复控制器建立通信连接。
  2. 如权利要求1所述的服务器固件自恢复系统,其特征在于,所述启动控制器包含ME;所述系统固件程序包括ME固件程序;
    相应的,所述启动控制器具体用于在自身与所述存储模块建立通信连接的情况下,从所述存储模块中读取所述ME固件程序以使所述ME运行,且在接收到电源按键信号后利用所述ME生成上电启动状态信号;
    所述逻辑控制器具体用于接收系统的电源按键信号,并将所述电源按键信号发送所述启动控制器;判断在预设时间内是否接收到所述启动控制器返回的上电启动状态信号,若是,则控制系统硬件上电以启动系统;若否,则确定所述存储模块启动异常,并将所述存储模块切换至与所述修复控制器建立通信连接。
  3. 如权利要求2所述的服务器固件自恢复系统,其特征在于,所述逻辑控制器包括:
    与所述启动控制器连接的检测模块,用于在将所述电源按键信号发送至所述启动控制器后,若在预设时间内未接收到所述启动控制器返回的上电启动状态信号,则将状态信号超时结果发送至控制模块;
    第一选通端与所述存储模块连接、第一传输端与所述启动控制器连接、第二传输端与所述修复控制器连接的选通模块;
    与所述修复控制器内第一通信模块连接的第二通信模块;
    与所述第二通信模块连接的状态寄存器;
    分别与所述检测模块、所述第二通信模块、所述状态寄存器及所述选通模块连接的控制模块,用于在初始情况下控制所述选通模块默认将所述第一选通端与所述第一传输端连通;在接收到所述状态信号超时结果后确定所述存储模块启动异常,并将所述存储模块的启动异常情况及原因记录至所述状态寄存器;
    相应的,所述修复控制器还用于通过通信模块轮询所述状态寄存器,并在查询到所述存储模块的启动异常情况时,通过与所述控制模块的通信控制所述选通模块将所述第一选通端切换至与所述第二传输端连通,以基于查询到的所述存储模块的启动异常原因自动对所述存储模块进行异常修复处理。
  4. 如权利要求3所述的服务器固件自恢复系统,其特征在于,所述逻辑控制器还包括:
    与所述控制模块连接的状态记忆模块,用于存储所述选通模块最近一次的选通状态;其中,所述状态记忆模块初始默认的所述选通模块的选通状态为:所述第一选通端与所述第一传输端连通;
    所述控制模块还用于在系统AC上电且自身开始工作后,从所述状态记忆模块读取所述选通模块上次的选通状态,并控制所述选通模块保持上次的选通状态。
  5. 如权利要求2-4任一项所述的服务器固件自恢复系统,其特征在于,所述存储模块包括均存储有系统固件程序的主存储模块和备用存储模块;
    相应的,所述逻辑控制器具体用于在初始情况下默认将所述主存储模块与所述启动控制器建立通信连接;若检测到所述启动控制器启动系统失败,则确定所述主存储模块启动异常,将所述主存储模块切换至与所述修复控制器建立通信连接,并将所述备用存储模块与所述启动控制器建立通信连接,以使所述启动控制器重启系统。
  6. 如权利要求5所述的服务器固件自恢复系统,其特征在于,所述逻辑控制器还用于在将所述备用存储模块与所述启动控制器建立通信连接后,若检测到所述启动控制器重启系统失败,则确定所述备用存储模块启动异常;
    相应的,所述修复控制器具体用于在所述主存储模块启动异常、所述备用存储模块正常工作时,自动对所述主存储模块进行异常修复处理;在所述主存储模块和所述备用存储模块均启动异常时,依次对所述主存储模块和所述备用存储模块进行异常修复处理。
  7. 如权利要求6所述的服务器固件自恢复系统,其特征在于,对所述主存储模块进行异常修复处理的过程,包括:
    判断所述主存储模块的供电电压是否正常;
    若供电电压异常,则确定所述主存储模块的周围电路异常;
    若供电电压正常,则确定所述主存储模块的周围电路正常,并判断是否可正常访问到所述主存储模块;
    若不可正常访问到,则确定所述主存储模块损坏;
    若可正常访问到,则确定所述主存储模块正常,并判断是否可读取到所述主存储模块内的ME固件程序;
    若不可读取到,则确定所述主存储模块缺失所述ME固件程序,并将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,且将所述主存储模块切回至与所述启动控制器重新建立通信连接;
    若可读取到,则对所述主存储模块内的ME固件程序进行校验,若校验失败,则确定所述主存储模块内的ME固件程序损坏,并将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,且将所述主存储模块切回至与所述启动控制器重新建立通信连接;
    若所述主存储模块依旧启动异常,则确定所述ME自身故障。
  8. 如权利要求6所述的服务器固件自恢复系统,其特征在于,依次对所述主存储模块和所述备用存储模块进行异常修复处理的过程,包括:
    判断所述主存储模块的供电电压是否正常;
    若供电电压异常,则确定所述主存储模块的周围电路异常;
    若供电电压正常,则确定所述主存储模块的周围电路正常,并判断是否可正常访问到所述主存储模块;
    若不可正常访问到,则确定所述主存储模块损坏;
    若可正常访问到,则确定所述主存储模块正常,并判断是否可读取到所述主存储模块内的ME固件程序;
    若不可读取到,则确定所述主存储模块缺失所述ME固件程序,将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,并在刷写完成后控制所述逻辑控制器将所述主存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至所述备用存储模块中,并在刷写完成后控制所述逻辑控制器将所述备用存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定所述主存储模块和所述备用存储模块已修复;
    若可读取到,则对所述主存储模块内的ME固件程序进行校验,若校验失败,则确定所述主存储模块内的ME固件程序损坏,并将备份的ME固件程序重新刷写至所述主存储模块进行自动修复,并在刷写完成后控制所述逻辑控制器将所述主存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否能够正常重启;若系统能够正常重启,则直接重新将备份的ME固件程序重新刷写至所述备用存储模块中,并在刷写完成后控制所述逻辑控制器将所述备用存储模块切回至与所述启动控制器建立通信连接,且控制所述启动控制器重新启动系统,判断系统是否可以正常重启;若系统可以正常重启,则确定所述主存储模块和所述备用存储模块已修复;
    若系统重启异常,则确定所述ME自身故障。
  9. 如权利要求1所述的服务器固件自恢复系统,其特征在于,所述系统固件程序还包括BIOS固件程序;所述启动控制器与所述修复控制器连接;
    所述启动控制器还用于在系统上电完成后,从所述存储模块中读取所述BIOS固件程序以启动BIOS;
    所述修复控制器还用于在所述启动控制器启动BIOS的过程中进行BIOS启动异常判断,并在所述BIOS因所述BIOS固件程序存在问题而启动异常时,自动对所述存储模块进行BIOS固件程序异常修复处理。
  10. 如权利要求1所述的服务器固件自恢复系统,其特征在于,所述启动控制器为CPU;所述修复控制器为BMC;所述逻辑控制器为CPLD。
  11. 一种服务器,其特征在于,包括如权利要求1-10任一项所述的服务器固件自恢复系统。
PCT/CN2021/121423 2021-03-26 2021-09-28 一种服务器固件自恢复系统及服务器 WO2022198973A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/024,809 US20230333621A1 (en) 2021-03-26 2021-09-28 Server firmware self-recovery system and server

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110326283.9 2021-03-26
CN202110326283.9A CN113064757B (zh) 2021-03-26 2021-03-26 一种服务器固件自恢复系统及服务器

Publications (1)

Publication Number Publication Date
WO2022198973A1 true WO2022198973A1 (zh) 2022-09-29

Family

ID=76564071

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121423 WO2022198973A1 (zh) 2021-03-26 2021-09-28 一种服务器固件自恢复系统及服务器

Country Status (3)

Country Link
US (1) US20230333621A1 (zh)
CN (1) CN113064757B (zh)
WO (1) WO2022198973A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115576747A (zh) * 2022-11-21 2023-01-06 苏州浪潮智能科技有限公司 基板管理控制器固件故障恢复方法、系统、设备及介质
CN116302011A (zh) * 2023-05-24 2023-06-23 广东电网有限责任公司佛山供电局 一种电缆监测设备固件升级方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064757B (zh) * 2021-03-26 2023-02-28 山东英信计算机技术有限公司 一种服务器固件自恢复系统及服务器
CN113672306B (zh) * 2021-10-20 2022-02-18 苏州浪潮智能科技有限公司 服务器组件自检异常恢复方法、装置、系统及介质
CN116501409B (zh) * 2023-04-27 2024-05-07 合芯科技(苏州)有限公司 一种基于双Flash的服务器启动方法、计算机设备及存储介质
CN117880061B (zh) * 2024-03-11 2024-05-17 雅安数字经济运营有限公司 一种数据中心运维监控系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200933A (zh) * 2010-03-23 2011-09-28 深圳华北工控股份有限公司 一种基于双SPI Flash的系统BIOS自动修复方法
US8104031B2 (en) * 2007-01-30 2012-01-24 Fujitsu Limited Storage system, storage unit, and method for hot swapping of firmware
CN102722423A (zh) * 2011-03-29 2012-10-10 比亚迪股份有限公司 一种便携式终端及其自行修复的方法
CN101916216B (zh) * 2010-09-08 2012-11-07 神州数码网络(北京)有限公司 一种嵌入式操作系统中bootrom的自动修复装置和控制方法
CN109992316A (zh) * 2019-04-10 2019-07-09 苏州浪潮智能科技有限公司 一种双bios控制系统及其控制方法、装置、设备、介质
CN113064757A (zh) * 2021-03-26 2021-07-02 山东英信计算机技术有限公司 一种服务器固件自恢复系统及服务器

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3961478B2 (ja) * 2002-12-27 2007-08-22 オムロン株式会社 プログラマブルコントローラ用ユニット及びメモリ自動復旧方法
CN103246579A (zh) * 2012-02-06 2013-08-14 鸿富锦精密工业(深圳)有限公司 基板管理控制器系统
CN103246583A (zh) * 2012-02-09 2013-08-14 鸿富锦精密工业(深圳)有限公司 具有bmc固件修复功能的电子装置及修复方法
CN108304299A (zh) * 2018-03-02 2018-07-20 郑州云海信息技术有限公司 服务器上电状态监测系统及方法、计算机存储器及设备
CN109446012A (zh) * 2018-11-01 2019-03-08 郑州云海信息技术有限公司 一种目标部件的上电启动方法、装置及设备
CN111143132B (zh) * 2019-12-30 2022-06-10 山东英信计算机技术有限公司 一种bios恢复方法、装置、设备及可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8104031B2 (en) * 2007-01-30 2012-01-24 Fujitsu Limited Storage system, storage unit, and method for hot swapping of firmware
CN102200933A (zh) * 2010-03-23 2011-09-28 深圳华北工控股份有限公司 一种基于双SPI Flash的系统BIOS自动修复方法
CN101916216B (zh) * 2010-09-08 2012-11-07 神州数码网络(北京)有限公司 一种嵌入式操作系统中bootrom的自动修复装置和控制方法
CN102722423A (zh) * 2011-03-29 2012-10-10 比亚迪股份有限公司 一种便携式终端及其自行修复的方法
CN109992316A (zh) * 2019-04-10 2019-07-09 苏州浪潮智能科技有限公司 一种双bios控制系统及其控制方法、装置、设备、介质
CN113064757A (zh) * 2021-03-26 2021-07-02 山东英信计算机技术有限公司 一种服务器固件自恢复系统及服务器

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115576747A (zh) * 2022-11-21 2023-01-06 苏州浪潮智能科技有限公司 基板管理控制器固件故障恢复方法、系统、设备及介质
CN116302011A (zh) * 2023-05-24 2023-06-23 广东电网有限责任公司佛山供电局 一种电缆监测设备固件升级方法
CN116302011B (zh) * 2023-05-24 2023-08-18 广东电网有限责任公司佛山供电局 一种电缆监测设备固件升级方法

Also Published As

Publication number Publication date
CN113064757B (zh) 2023-02-28
US20230333621A1 (en) 2023-10-19
CN113064757A (zh) 2021-07-02

Similar Documents

Publication Publication Date Title
WO2022198973A1 (zh) 一种服务器固件自恢复系统及服务器
WO2022198972A1 (zh) 一种服务器启动过程中的故障定位方法、系统及装置
US7941658B2 (en) Computer system and method for updating program code
CN111158599B (zh) 一种写数据的方法、装置、设备及存储介质
JP5183542B2 (ja) 計算機システム及び設定管理方法
CN113360347B (zh) 一种服务器及其控制方法
TW201520895A (zh) Bios自動恢復系統及方法
CN114116280B (zh) 交互式bmc自恢复方法、系统、终端及存储介质
CN111143132A (zh) 一种bios恢复方法、装置、设备及可读存储介质
US11809295B2 (en) Node mode adjustment method for when storage cluster BBU fails and related component
TWI786871B (zh) 電腦和系統啓動方法
WO2020192669A1 (zh) 燃气表智能控制器及其固件升级启动方法
US6363493B1 (en) Method and apparatus for automatically reintegrating a module into a computer system
CN111158963A (zh) 一种服务器固件冗余启动方法和服务器
CN110928726A (zh) 一种基于看门狗及pxe的嵌入式系统自恢复方法及系统
JPH07219860A (ja) メモリ書換え制御方法と制御装置
CN113835971A (zh) 一种服务器背板异常点灯的监测方法及相关组件
JP3231561B2 (ja) バックアップメモリ制御方式
CN110865906B (zh) 一种电机初始位置角度存储方法、装置、车辆及存储介质
JP6911591B2 (ja) 情報処理装置、制御装置および情報処理装置の制御方法
JP2009025967A (ja) 二重化ファームウェアのバックアップ方式、方法、及び、オペレーティングシステム
TW201604781A (zh) 寫入基本輸入輸出系統程式碼的電路與寫入方法
CN113868181B (zh) 一种存储设备pcie链路协商方法、系统、设备及介质
CN112350888B (zh) 启动状态检测系统及其方法
CN114218010B (zh) 一种数据备份与恢复方法、系统、终端设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21932582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21932582

Country of ref document: EP

Kind code of ref document: A1