WO2022037014A1 - Procédé de restauration d'amorçage pour serveur arm et appareil associé - Google Patents

Procédé de restauration d'amorçage pour serveur arm et appareil associé Download PDF

Info

Publication number
WO2022037014A1
WO2022037014A1 PCT/CN2021/073359 CN2021073359W WO2022037014A1 WO 2022037014 A1 WO2022037014 A1 WO 2022037014A1 CN 2021073359 W CN2021073359 W CN 2021073359W WO 2022037014 A1 WO2022037014 A1 WO 2022037014A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
error counter
startup
boot
preset threshold
Prior art date
Application number
PCT/CN2021/073359
Other languages
English (en)
Chinese (zh)
Inventor
孙秀强
黄家明
乔英良
李道童
王兵
李勋堂
张炳会
孙良勇
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2022037014A1 publication Critical patent/WO2022037014A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating

Definitions

  • the present application relates to the technical field of servers, and in particular, to a startup repair method for an ARM server; and also to a startup repair device, device, and computer-readable storage medium for startup of an ARM server.
  • the Quicksilver processor is a processor chip with 80 64-bit ARM (Advanced RISC Machines, advanced reduced instruction set machines) processor cores independently designed by Ampere using the V8 architecture authorized by ARM.
  • SCP System Control Processor, system control processor
  • firmware includes SMpro (System Management Program, system management program) and PMpro (Power Management Program, power management program) microcontroller management program, SMpro microcontroller provides the entire Management of the system, including the secure boot processing mechanism, managing processor clocks and restarts, system boot, power failure detection, and error handling.
  • PMpro microcontrollers provide power management functions, including ACPI (Advanced Configuration and Power Interface) Pstate (Power state, power state) state control, dynamic regulation of voltage and frequency, dynamic power consumption evaluation and over temperature protection mechanisms, etc.
  • ACPI Advanced Configuration and Power Interface
  • Pstate Power state, power state
  • the Quicksilver processor secure boot solution involves multiple modules such as SMpro, PMpro, ATF (ARM Trusted Firmware, ARM Trusted Firmware) and UEFI (Unified Extensible Firmware Interface, Unified Extensible Firmware Interface).
  • the SMpro boot program follows the TBBR ( Transmission BufferBloat Request, buffer congestion demand transmission) standard protocol, carry out SLM (Source Library Maintenance, source library maintenance) header file security verification on the SEC phase and image files of SMpro, and only after the verification is successful, can continue to guide down to PMpro At the same time, PMpro will also follow the TBBR protocol specification of the ARM platform to perform the security verification of the SLM header file for the key and content of the PMpro. After the SCP firmware security boot verification is completed, the ATF program can be started.
  • TBBR Transmission BufferBloat Request, buffer congestion demand transmission
  • the ATF firmware includes BL1 (Boot Loader stage1 , Boot Loader stage 1), BL2 (Boot Loader stage2, Boot Loader stage 2), BL31 (Boot Loader stage3-1, Boot Loader stage 3-1), BL32 (Boot Loader stage3-2, Boot Loader stage 3-2), BL33 (Boot Loader stage 3-3, boot loader stage 3-3) and other stages, each stage also follows the ARM platform TBBR protocol for security verification, when the above security verification is completed, it can be in the BL33 stage Jump to UEFI (Unified Extensible Firmware Interface, Unified Extensible Firmware Interface) firmware to boot normally. Incorrect setting of processor core voltage, incorrect memory voltage setting, incorrect memory speed setting, and incorrect memory working mode setting will cause the system to fail to start normally, which will bring great difficulties to the batch deployment and application maintenance of ARM servers. big challenge.
  • UEFI Unified Extensible Firmware Interface
  • the purpose of this application is to provide a startup repair method for an ARM server, which can automatically repair the wrong settings of the ARM server and ensure that the server can be started normally.
  • Another object of the present application is to provide an ARM server boot repair device, device and computer-readable storage medium, all of which have the above technical effects.
  • the present application provides a startup repair method for an ARM server, including:
  • the configuration information of the ARM processor is restored to the default setting, and after the configuration information of the ARM processor is restored to the default setting, the boot-up phase is restarted.
  • the condition for triggering judgment as to whether the value of the error counter reaches the preset threshold is that any one of the startup boot phases in the startup process of the ARM server fails to execute and the corresponding watchdog timer exceeds the preset value.
  • the boot-up phase includes:
  • the changing the value of the error counter according to a preset rule includes:
  • the value of the error counter is restored to the initial value.
  • restoring the value of the error counter to an initial value including:
  • the value of the error counter is restored to the initial value before the end of the UEFI phase.
  • the present application also provides a device for repairing an ARM server, including:
  • initialization module used to initialize the value of the error counter
  • a judgment module used for judging whether the value of the error counter reaches a preset threshold whenever any one of the startup boot stages in the ARM server startup process fails to execute;
  • a changing module configured to change the value of the error counter according to a preset rule and restart the boot-up phase if the preset threshold is not reached;
  • a restoration module configured to restore the configuration information of the ARM processor to a default setting if the preset threshold is reached, and restart the bootstrap after restoring the configuration information of the ARM processor to a default setting stage.
  • the present application also provides a boot repair device for an ARM server, including:
  • the processor is configured to implement the steps of the ARM server startup repair method according to any one of the above when executing the computer program.
  • the present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the ARM described in any of the above is implemented. Steps for the Startup Repair method for the server.
  • the startup repair method for an ARM server includes: initializing the value of an error counter; whenever any one of the startup boot phases in the startup process of the ARM server fails to execute, judging whether the value of the error counter reaches a preset threshold; If the preset threshold is not reached, the value of the error counter is changed according to the preset rule, and the boot-up phase is restarted; if the preset threshold is reached, the configuration information of the ARM processor is restored to Default settings, and after restoring the configuration information of the ARM processor to the default settings, restart the boot-up phase.
  • the boot-repair apparatus, device and computer-readable storage medium for an ARM server provided by the present application all have the above technical effects.
  • FIG. 1 is a schematic flowchart of a startup repair method for an ARM server provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a startup repair device for an ARM server provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a startup repair device for an ARM server provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the core of the present application is to provide a startup repair method for an ARM server, which can automatically repair the wrong settings of the ARM server and ensure that the server can be started normally.
  • Another core of the present application is to provide a boot repair device, equipment and computer-readable storage medium for an ARM server, all of which have the above-mentioned technical effects.
  • FIG. 1 is a schematic flowchart of an ARM server startup repair method provided by an embodiment of the present application. Referring to FIG. 1, the method mainly includes:
  • the present application adds a counter, that is, an error counter, on the SMpro of the SCP firmware, and counts the number of restarts through the error counter.
  • a counter that is, an error counter
  • the SMpro of the SCP firmware initializes the value of the error counter, such as initializing the value of the error counter to zero.
  • the booting phase includes a SMpro phase, a PMpro phase, an ATF phase, and a UEFI phase.
  • the SMpro stage, the PMpro stage, the ATF stage, and the UEFI stage are executed sequentially. Specifically, if the execution of the SMpro stage is successful, it will lead down to the PMpro stage. Further, if the PMpro stage is successfully executed, the ATF stage is further executed, including BL1, BL2, BL31, BL32, BL33 and other stages. If the ATF stage is successfully executed, jump to the UEFI stage in the BL33 stage. Further, if the UEFI stage is successfully executed, the system is booted to the OS, that is, the operating system.
  • the above startup and boot stages are executed normally in sequence, but after the configuration information is manually modified, one or more of the above boot and boot stages cannot be executed normally, thereby causing the ARM server to fail to start normally. Therefore, in the present application, whenever any one of the above startup and boot stages fails to execute, the SMpro firmware determines whether the value of the error counter reaches a preset threshold, and the preset threshold represents the maximum number of restarts, that is, the SMpro firmware determines whether the number of restarts reaches the threshold. Preset threshold.
  • the preset threshold is 3.
  • the above-mentioned condition for triggering judgment on whether the value of the error counter reaches the preset threshold is that any one of the startup boot phases in the startup process of the ARM server fails to execute and the corresponding watchdog timer exceeds the preset value. That is, whenever any one of the startup boot phases in the startup process of the ARM server fails and the corresponding watchdog timer exceeds the preset value, the SMpro firmware determines whether the value of the error timer reaches the preset threshold.
  • the SMpro firmware determines whether the value of the error counter reaches the preset threshold.
  • the SMpro firmware determines whether the value of the error counter reaches the preset threshold.
  • the SMpro firmware determines whether the value of the error counter reaches the preset threshold.
  • the UEFI stage fails to boot to the OS, the system cannot be booted normally, and the BMC's FRB-2 watchdog timer exceeds the preset value, the BMC informs the SMpro firmware that the UEFI stage fails to execute, and the SMpro firmware judges incorrectly. Whether the value of the counter reaches the preset threshold.
  • the above-mentioned preset value can be set to any value greater than the power-on time.
  • the value of the error counter is changed according to the preset rules, and the bootstrap stage is restarted, specifically restarting the first bootstrap stage, that is, restarting the SMpro stage. If the SMpro stage is successfully executed, Then, the PMpro stage is further automatically executed. If the PMpro stage is successfully executed, the ATF stage is further executed, and so on. After the restart, the operation of judging whether the value of the error counter reaches the preset threshold value will also be performed whenever any of the startup boot phases in the startup process of the ARM server fails and the corresponding watchdog timer exceeds the preset value.
  • the above-mentioned changing the value of the error counter according to the preset rule includes: increasing the value of the error counter by one. Specifically, whenever any of the startup boot phases in the ARM server startup process fails, determine whether the value of the error counter reaches the preset threshold, if not, increment the value of the error counter by one, and restart the boot boot phase . In this way, the value of the error counter corresponds to the number of restarts, and when the number of restarts reaches the maximum value, the value of the error counter reaches the preset threshold.
  • S104 If the preset threshold is reached, restore the configuration information of the ARM processor to the default setting, and restart the boot stage after restoring the configuration information of the ARM processor to the default setting.
  • the backup NVRAM Non-Volatile Random Access Memory, non-volatile random access memory
  • BIOS Basic Input/Output System
  • the value of the error counter is incremented by one, and the preset threshold is equal to 3 as an example:
  • the execution of the SMpro stage fails, judge whether the value of the error counter is 3. If the value of the error counter is 3, restore the configuration information of the ARM processor to the default setting; if the value of the error counter is not equal to 3, restart SMpro stage, and the value of the error counter is incremented by one; if the SMpro stage is successfully executed, the PMpro stage is further executed. If the execution of the PMpro stage fails, judge whether the value of the error counter is 3. If the value of the error counter is 3, restore the configuration information of the ARM processor to the default setting; if the value of the error counter is not equal to 3, restart SMpro stage, and the value of the error counter is incremented by one; if the PMpro stage is successfully executed, the ATF stage is further executed.
  • the execution of the ATF stage fails, it is judged whether the value of the error counter is 3. If the value of the error counter is 3, the configuration information of the ARM processor is restored to the default setting; if the value of the error counter is not equal to 3, The SMpro stage is restarted, and the value of the error counter is incremented by one; if the ATF stage is successfully executed, the UEFI stage is further executed.
  • the ARM server When the ARM server starts normally, restore the value of the error counter to the initial value. And it may specifically be restoring the value of the error counter to the initial value before the UEFI stage ends. That is, there is no problem in the UEFI stage, and before booting to the OS, restore the value of the error counter to the initial value, such as clearing the value of the error counter.
  • the startup repair method for an ARM server includes: initializing the value of an error counter; whenever any one of the startup boot phases in the startup process of the ARM server fails to execute, judging whether the value of the error counter is not Reaching the preset threshold; if the preset threshold is not reached, the value of the error counter is changed according to the preset rule, and the startup boot stage is restarted; if the preset threshold is reached, the ARM processor The configuration information of the ARM processor is restored to the default setting, and after the configuration information of the ARM processor is restored to the default setting, the startup and booting stage is restarted.
  • an error counter is added to count the number of restarts, and when the value of the error counter reaches a preset threshold, the configuration information of the ARM processor is restored to the default setting and restarted. Therefore, by restoring the configuration information of the ARM processor to the default settings, the ARM server can be guaranteed to start normally, and the problem that the ARM server cannot be started normally caused by artificial modification of the configuration information can be effectively solved.
  • FIG. 2 is a schematic diagram of an apparatus for booting up an ARM server according to an embodiment of the present application.
  • the apparatus includes:
  • an initialization module 10 for initializing the value of the error counter
  • the judgment module 20 is used for judging whether the value of the error counter reaches a preset threshold value whenever any one of the startup boot stages in the ARM server startup process fails to execute;
  • the changing module 30 is used for changing the value of the error counter according to the preset rule and restarting the boot phase if the preset threshold is not reached;
  • the restoring module 40 is configured to restore the configuration information of the ARM processor to the default setting if the preset threshold is reached, and restart the boot stage after restoring the configuration information of the ARM processor to the default setting.
  • the booting phase includes: a SMpro phase, a PMpro phase, an ATF phase, and a UEFI phase.
  • the condition for triggering the judgment of whether the value of the error counter reaches the preset threshold is that any one of the startup boot phases in the startup process of the ARM server fails to execute and the corresponding watchdog timer counts exceeds the preset value.
  • the judging module 20 includes:
  • the first judgment unit is used for judging whether the value of the error counter reaches the preset threshold when the execution of the SMpro stage fails and the timing of the watchdog timer of the SCP firmware exceeds the preset value;
  • the second judgment unit is used to judge whether the value of the error counter reaches the preset threshold when the execution of the PMpro stage fails and the watchdog timer of the SCP firmware exceeds the preset value;
  • the third judgment unit is used to judge whether the value of the error counter reaches the preset threshold when the execution of the ATF stage fails and the watchdog timer of the SCP firmware exceeds the preset value;
  • the fourth judgment unit is configured to judge whether the value of the error counter reaches the preset threshold when the UEFI stage fails to execute and the FRB-2 watchdog timer of the BMC exceeds the preset value.
  • the changing module 30 is specifically configured to increase the value of the error counter by one if the preset threshold is not reached, and restart the boot phase.
  • the count value restoration module is used to restore the value of the error counter to the initial value after the ARM server starts normally.
  • the count value restoration module is specifically configured to restore the value of the error counter to the initial value before the UEFI stage ends.
  • the present application also provides a boot-repair device for an ARM server.
  • the device includes a memory 1 and a processor 2 .
  • the memory 1 is used to store the computer program;
  • the processor 2 is used to execute the computer program to realize the following steps:
  • the present application also provides a computer-readable storage medium 4. As shown in FIG. 4, a computer program 41 is stored on the computer-readable storage medium 4. When the computer program 41 is executed by the processor, the following steps can be implemented:
  • the computer-readable storage medium may include: a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc. that can store program codes medium.
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically programmable ROM
  • EEPly erasable programmable ROM registers
  • hard disk hard disk
  • removable disk CD-ROM (compact disk read only memory)
  • CD-ROM compact disk read only memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

Procédé de restauration d'amorçage destiné à un serveur ARM. Le procédé consiste : à initialiser la valeur d'un compteur d'erreurs (S101) ; chaque fois que l'exécution de n'importe quel niveau de programme d'amorçage échoue pendant le processus d'amorçage d'un serveur ARM, à déterminer si la valeur du compteur d'erreurs atteint une valeur seuil prédéfinie (S102) ; tant que la valeur n'atteint pas la valeur seuil prédéfinie, à modifier la valeur du compteur d'erreurs selon une règle prédéfinie et à redémarrer le niveau de programme d'amorçage (S103) ; et dès que la valeur atteint la valeur seuil prédéfinie, à restaurer des informations de configuration d'un processeur ARM à un réglage par défaut et à redémarrer le niveau de programme d'amorçage après restauration des informations de configuration du processeur ARM au réglage par défaut (S104). Grâce au procédé, un réglage d'erreur d'un serveur ARM peut être automatiquement restauré, ce qui permet d'assurer un amorçage normal du serveur. On divulgue en outre un appareil de restauration d'amorçage destiné à un serveur ARM et un dispositif et un support de stockage lisible par ordinateur, présentant tous lesdits effets techniques.
PCT/CN2021/073359 2020-08-21 2021-01-22 Procédé de restauration d'amorçage pour serveur arm et appareil associé WO2022037014A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010851698.3A CN112000508A (zh) 2020-08-21 2020-08-21 一种arm服务器的启动修复方法及相关装置
CN202010851698.3 2020-08-21

Publications (1)

Publication Number Publication Date
WO2022037014A1 true WO2022037014A1 (fr) 2022-02-24

Family

ID=73473974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073359 WO2022037014A1 (fr) 2020-08-21 2021-01-22 Procédé de restauration d'amorçage pour serveur arm et appareil associé

Country Status (2)

Country Link
CN (1) CN112000508A (fr)
WO (1) WO2022037014A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000508A (zh) * 2020-08-21 2020-11-27 苏州浪潮智能科技有限公司 一种arm服务器的启动修复方法及相关装置
CN113032026A (zh) * 2021-03-19 2021-06-25 山东英信计算机技术有限公司 一种服务器主板的固件管理方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100010390A (ko) * 2008-07-22 2010-02-01 엘지전자 주식회사 마이크로컴퓨터와 마이크로컴퓨터의 제어방법
CN107038085A (zh) * 2016-02-03 2017-08-11 阿里巴巴集团控股有限公司 一种客户端应用的修复方法、装置及系统
CN107844330A (zh) * 2017-10-25 2018-03-27 郑州云海信息技术有限公司 一种增强arm服务器启动代码可靠性的方法与系统
CN107894949A (zh) * 2017-10-11 2018-04-10 五八有限公司 异常处理的方法、装置及设备
CN109783149A (zh) * 2019-01-17 2019-05-21 Oppo广东移动通信有限公司 开机控制方法、装置、移动终端以及存储介质
CN112000508A (zh) * 2020-08-21 2020-11-27 苏州浪潮智能科技有限公司 一种arm服务器的启动修复方法及相关装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100010390A (ko) * 2008-07-22 2010-02-01 엘지전자 주식회사 마이크로컴퓨터와 마이크로컴퓨터의 제어방법
CN107038085A (zh) * 2016-02-03 2017-08-11 阿里巴巴集团控股有限公司 一种客户端应用的修复方法、装置及系统
CN107894949A (zh) * 2017-10-11 2018-04-10 五八有限公司 异常处理的方法、装置及设备
CN107844330A (zh) * 2017-10-25 2018-03-27 郑州云海信息技术有限公司 一种增强arm服务器启动代码可靠性的方法与系统
CN109783149A (zh) * 2019-01-17 2019-05-21 Oppo广东移动通信有限公司 开机控制方法、装置、移动终端以及存储介质
CN112000508A (zh) * 2020-08-21 2020-11-27 苏州浪潮智能科技有限公司 一种arm服务器的启动修复方法及相关装置

Also Published As

Publication number Publication date
CN112000508A (zh) 2020-11-27

Similar Documents

Publication Publication Date Title
US10534618B2 (en) Auto bootloader recovery in BMC
US9946553B2 (en) BMC firmware recovery
TWI754317B (zh) 用於網路裝置之最佳啟動路徑之方法和系統
JP5575338B2 (ja) 情報処理装置、情報処理方法、およびコンピュータプログラム
TWI664574B (zh) 唯讀記憶體之修補啟動碼的方法及系統單晶片
US9880908B2 (en) Recovering from compromised system boot code
US8918778B2 (en) Method of fail safe flashing management device and application of the same
CN105917306B (zh) 用于配置系统固件配置数据的系统和方法
WO2022037014A1 (fr) Procédé de restauration d'amorçage pour serveur arm et appareil associé
US20040158702A1 (en) Redundancy architecture of computer system using a plurality of BIOS programs
EP2513781A1 (fr) Procédés et dispositifs permettant de mettre à jour un micrologiciel d'un composant au moyen d'une application de mise à jour de micrologiciel
WO2016206514A1 (fr) Procédé et dispositif de traitement de démarrage
BR112014014815B1 (pt) Dispositivo de computação, método e meio de armazenamento para realização de cópia de segurança de firmware
CN103885847A (zh) 一种基于嵌入式系统的喂狗方法及装置
US20090271660A1 (en) Motherboard, a method for recovering the bios thereof and a method for booting a computer
TW201239759A (en) BIOS update method and computer system for using the same
CN107766102B (zh) 双基本输出入系统(bios)的开机方法及具有其的电子装置
CN111090546A (zh) 一种操作系统重启方法、装置、设备及可读存储介质
JP2005222366A (ja) 自動復帰方法/プログラム/プログラム記録媒体、処理装置
US9342392B2 (en) Image forming apparatus, image forming apparatus control method, and recording medium
JP5585502B2 (ja) 情報処理装置およびそのファームウェア更新方法
TWI554876B (zh) 節點置換處理方法與使用其之伺服器系統
CN111078452A (zh) 一种bmc固件镜像恢复方法与装置
TWI839136B (zh) 基板管理控制器的下游裝置的韌體更新方法
TWI778320B (zh) 具有自動啟動安全作業系統的啟動方法及其啟動系統

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857120

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21857120

Country of ref document: EP

Kind code of ref document: A1