CN109491813B - ARM architecture server and management method thereof - Google Patents

ARM architecture server and management method thereof Download PDF

Info

Publication number
CN109491813B
CN109491813B CN201710810974.XA CN201710810974A CN109491813B CN 109491813 B CN109491813 B CN 109491813B CN 201710810974 A CN201710810974 A CN 201710810974A CN 109491813 B CN109491813 B CN 109491813B
Authority
CN
China
Prior art keywords
arm
event information
peripheral device
arm processor
architecture server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710810974.XA
Other languages
Chinese (zh)
Other versions
CN109491813A (en
Inventor
王绍宇
孙佩傑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technical Steel Technology Co ltd
Original Assignee
Giga Byte Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Giga Byte Technology Co Ltd filed Critical Giga Byte Technology Co Ltd
Priority to CN201710810974.XA priority Critical patent/CN109491813B/en
Publication of CN109491813A publication Critical patent/CN109491813A/en
Application granted granted Critical
Publication of CN109491813B publication Critical patent/CN109491813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy

Abstract

The invention provides an ARM architecture server, which comprises at least one peripheral device, a substrate management controller and an ARM processor, wherein the ARM processor comprises ARM Trusted Firmware (ATF). The baseboard management controller is used for monitoring and judging whether at least one peripheral device and the ARM processor are abnormal or not, and generating event information corresponding to the ARM processor or one of the peripheral devices according to a judgment result. The ARM trusted firmware is used for receiving the event information from the baseboard management controller and executing an event processing operation on an ARM processor or a peripheral device corresponding to the event information. In addition, a management method of the ARM architecture server is also provided.

Description

ARM architecture server and management method thereof
Technical Field
The present invention relates to a server management method, and more particularly, to an ARM architecture server capable of automatically removing obstacles and a management method thereof.
Background
A Baseboard Management Controller (BMC) is used to manage a server system. Generally, in order to monitor whether the internal operation of the computer system is normal, a user may utilize a bmc configured on a motherboard to test the computer system. A common approach includes a remote bmc to detect readings from various sensors (e.g., fan speed or processor temperature) used to sense the operation of various components of a computer system. When a user finds that the reading value of the sensor is abnormal, the user must go to the site to repair the server (for example, replace parts and the like). However, the excessive reaction time may cause more serious damage and data loss after the server is abnormal. Therefore, in order to maintain the normal operation and good service of the server, too long response time after abnormal sensor reading is not allowed.
Disclosure of Invention
The invention provides an ARM architecture server and a management method thereof, which can automatically repair when a BMC detects that an element is abnormal, so that the server can normally operate without interruption.
The invention provides an ARM architecture server, which comprises at least one peripheral device, a substrate management controller and an ARM processor. The baseboard management controller is coupled to the at least one peripheral device, and is used for monitoring and judging whether the at least one peripheral device and the ARM processor are abnormal or not, and generating event information corresponding to the ARM processor or one of the peripheral devices according to a judgment result. The ARM processor is coupled to the at least one peripheral device and the baseboard management controller, and includes ARM Trusted Firmware (ATF). The ARM trusted firmware is used for receiving the event information from the baseboard management controller and executing an event processing operation on an ARM processor or a peripheral device corresponding to the event information.
From another perspective, the present invention provides a management method for an ARM architecture server. The ARM architecture server comprises at least one peripheral device, a substrate management controller and an ARM processor. The management method comprises the following steps: the substrate management controller monitors and judges whether the at least one peripheral device and the ARM processor are abnormal or not; the substrate management controller generates event information corresponding to the ARM processor or one of the peripheral devices according to the judgment result; the substrate management controller transmits the event information to the ARM processor; and executing event processing operation on the ARM processor or the peripheral device corresponding to the event information by using ARM reliable firmware in the ARM processor.
In an embodiment of the invention, the event information corresponds to an ARM processor, and the event processing operation includes adjusting an operating frequency of the ARM processor.
In an embodiment of the invention, the peripheral device includes a memory device having at least two memory channels, the event information corresponds to one of the memory channels, and the event processing operation includes closing the memory channel corresponding to the event information.
In an embodiment of the invention, the peripheral device includes a PCI-E device, the event information corresponds to the PCI-E device, and the event handling operation includes performing a PCI-E reset.
In an embodiment of the invention, the ARM architecture server includes a plurality of exception levels, wherein an operating system of the ARM architecture server runs at a first exception level, and the ARM trusted firmware runs at a second exception level that is not lower than the first exception level.
Based on the above, in the ARM architecture server and the management method thereof provided by the embodiments of the present invention, the substrate management controller notifies the ARM trusted firmware of the abnormal event, and the ARM trusted firmware directly processes the abnormal device. Therefore, the user can repair the ARM architecture server in time without installing a monitoring program in the operating system, and the safety can be also considered.
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
Fig. 1 is a schematic block diagram of an ARM architecture server according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a management method of an ARM architecture server according to an embodiment of the present invention.
Wherein the reference numerals are:
100: ARM architecture server
110: baseboard management controller
120: at least one peripheral device
130: ARM processor
131: ARM trusted firmware
S210-S240: management method steps of ARM architecture server
Detailed Description
Fig. 1 is a schematic block diagram of an ARM architecture server according to an embodiment of the present invention. Referring to fig. 1, an ARM architecture server 100 according to an embodiment of the present invention includes a bmc 110, at least one peripheral device 120, and an ARM processor 130, wherein the bmc 110 and the ARM processor 130 are coupled to each peripheral device 120. In particular, the bmc 110 is also coupled to the ARM processor.
In one embodiment, the ARM architecture server 100 is, for example but not limited to, an ARMv8-a architecture, which includes a plurality of Exception levels (Exception levels), and a higher Exception level indicates a higher access right (privilege). For example, ARM architecture server 100 includes four exceptions EL0 through EL3Levels, where EL0 is an unprivileged level (unprivileged), EL1 is an operating system kernel mode (OS kernel mode), EL2 is a Hypervisor level (Hypervisor mode), and EL3 is a TrustZone®Monitoring hierarchy (TrustZone)® monitor mode)。
The bmc 110 is connected to each peripheral device 120 through an Intelligent Platform Management Bus (IPMB), for example, to monitor each peripheral device 120. In one embodiment, the peripheral device 120 includes a sensor for monitoring a fan speed or a processor temperature, a Dual-channel (Dual-channel) Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and a PCI-E Ethernet (Ethernet) card, but the invention is not limited thereto. The description of the bmc 110 and the monitoring server of each peripheral device 120 is provided for a person of ordinary skill in the art to obtain sufficient teachings from the prior art, and therefore will not be described herein again.
ARM processor 130 is a Reduced Instruction Set (RISC) architecture designed processor, such as an ARM Cortex-a, ARM Cortex-M, Cortex-a50 series, or CortexA-73 processor, but the invention is not limited thereto.
In one embodiment, ARM processor 130 includes ARM Trusted Firmware (ATF) 131 to provide ATF services. It is worth mentioning that ARM trusted firmware 131 is running at an exception level not lower than the operating system. For example, the operating system of ARM architecture server 100 runs at a first exception level (e.g., EL 1), and ARM trusted firmware 131 runs at a second exception level (e.g., EL 3) that is not lower than the first exception level. Thus, ARM trusted firmware 131 may access all ARM processors 130 themselves as well as all external or non-external peripheral devices 120 for various interfaces (e.g., interfaces such as SATA, PCI-E, LAN, GPIO, SPI, or I2C). ARM trusted firmware 131 and the ATF services that it is capable of providing are well known in the art and will not be described herein in detail.
In particular, when the bmc 110 detects an abnormality of the ARM processor 130 itself or the peripheral device 120, the ARM processor 130 is notified of the abnormality. Since ARM trusted firmware 131 in ARM processor 130 is running at an exception level that is no lower than the operating system, ARM trusted firmware 131 is able to directly handle or repair the component of ARM architecture server 100 that is experiencing the exception.
Fig. 2 is a flowchart illustrating a management method of an ARM architecture server according to an embodiment of the present invention. The management method in the embodiment of fig. 2 is applied to ARM architecture server 100 in fig. 1, and the detailed steps of the method in the embodiment of fig. 2 will be described below with reference to the components of ARM architecture server 100 in fig. 1.
Referring to fig. 2, in step S210, the bmc 110 monitors and determines whether at least one of the peripheral devices 120 and the ARM processor 130 is abnormal.
For example, the bmc 110 may monitor a temperature sensor in the ARM processor 130 to determine whether the ARM processor 130 is overheating; the bmc 110 may monitor the memory device in the ARM architecture server 100 to determine whether it is operating normally; or the bmc 110 may monitor whether the PCI-E devices on the PCI-E bus of the ARM architecture server 100 are functioning properly, for example, but not limited thereto. The skilled person can obtain sufficient teaching from the bmc and the prior knowledge to set the abnormal state of each component and complete the above operation for determining whether each component is abnormal, and therefore, the detailed description thereof is omitted here.
If no abnormality is found in the bmc 110, the step S210 is continuously performed. Otherwise, if the bmc 110 determines that the ARM processor 130 or one of the peripheral devices 120 is abnormal, the process proceeds to step S220, and the bmc 110 generates event information according to the determination result and transmits the event information to the ARM processor 130 in step S230.
In detail, the event information generated by the bmc 110 according to the determination result corresponds to the ARM processor 130 or the peripheral device 120 having an abnormality. For example, when the bmc 110 determines that the ARM processor 130 is overheated, it generates an event message indicating that the ARM processor 130 is overheated; when the bmc 110 determines that the memory device is not operating normally (e.g., the error bits of the data are too many and the error correction code mechanism cannot be corrected), event information indicating that the memory device cannot operate normally is generated.
In step S240, the ARM trusted firmware 131 in the ARM processor 130 receives the event information from the bmc 110 and performs an event processing operation on the ARM processor 130 or the peripheral device 120 corresponding to the event information, so as not to interrupt the operation of the ARM architecture server 100.
In one embodiment, the ARM processor 130 is connected to the peripheral device 120 (e.g., a temperature sensor), and the event information corresponds to the ARM processor 130, for example, indicating that the ARM processor 130 is overheated. After ARM trusted firmware 131 receives the event information, it performs event processing operations on ARM processor 130. For example, ARM trusted firmware 131 may reduce the operating frequency of ARM processor 130, or adjust the level of a CPU temperature regulator (throttle) in ARM processor 130, for example, to achieve the cooling effect. Thus, even if the ARM processor 130 is abnormal (overheated), it can still process the data in time by the ARM trusted firmware to avoid the serious damage that may cause the ARM architecture server 100 to stop service due to the interruption of operation.
In one embodiment, ARM architecture server 100 includes a peripheral device 120 (e.g., a dual channel ddr sdram) and the event information corresponds to one of the memory channels of the ddr sdram, e.g., indicating that the memory of the memory channel is not functioning properly. After ARM trusted firmware 131 receives the event message, it performs an event processing operation on the memory channel 130 corresponding to the event message. For example, ARM trusted firmware 131 may close the memory channel corresponding to the event message, and reserve the memory of another memory channel for normal operation. As such, ARM architecture server 100 can still continue to operate without interruption of the service.
In one embodiment, ARM architecture server 100 includes peripheral devices 120 (e.g., PCI-E devices), and the event information corresponds to one of the PCI-E devices or terminals (e.g., PCI-E ethernet cards), for example, indicating that the PCI-E device is not operating properly. After ARM trusted firmware 131 receives the event information, it performs event processing operations on the PCI-E device. For example, ARM trusted firmware 131 may perform a PCI-E reset (PCI-E reset) operation, for example, in an attempt to repair a PCI-E device. Thus, the PCI-E reset operation can be performed without restarting the ARM architecture server 100, so as to repair the PCI-E device and restore the normal operation of the PCI-E device. Those skilled in the art will appreciate that sufficient teachings can be obtained from the related art for PCI-E reset to accomplish the PCI-E reset operation described in this embodiment, and thus, the detailed description thereof is omitted here.
In summary, the ARM architecture server and the management method thereof provided by the embodiments of the present invention utilize the ATF firmware in the ARM processor to directly process or repair the abnormal component in the ARM architecture server, so as to avoid the server from being seriously damaged and losing data. On the other hand, the user does not need to install an additional monitoring program in the operating system, the risk of confidential data outflow caused by hiding a backdoor program in the monitoring program can be avoided, and the safety is improved for the service of the server.
Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. An ARM architecture server, comprising:
at least one peripheral device;
a baseboard management controller coupled to the at least one peripheral device; and
an ARM processor coupled to the at least one peripheral device and the baseboard management controller, wherein the ARM processor includes an ARM trusted firmware,
wherein the baseboard management controller is used for monitoring and judging whether the at least one peripheral device and the ARM processor are abnormal or not, and generating an event message according to a judgment result,
wherein the ARM trusted firmware is configured to receive the event information from the BMC, wherein the event information corresponds to one of the ARM processor or the at least one peripheral device,
the ARM trusted firmware is further used for executing an event processing operation on the ARM processor or the peripheral device corresponding to the event information, and the event processing operation comprises that the ARM trusted firmware can directly process or repair an abnormal component in the ARM architecture server;
the ARM architecture server comprises a plurality of exception levels, wherein the higher the exception level is, the higher the access authority is, an operating system of the ARM architecture server runs at a first exception level, and the ARM trusted firmware runs at a second exception level which is not lower than the first exception level, and can access all the ARM processor and plug-in or non-plug-in peripheral devices of various interfaces.
2. The ARM architecture server of claim 1, wherein the event information corresponds to the ARM processor, wherein the event processing operation comprises adjusting an operating frequency of the ARM processor.
3. The ARM architecture server of claim 1, wherein the at least one peripheral device includes a memory device, the memory device including at least two memory channels,
the event information corresponds to one of the memory channels, and the event processing operation includes closing the memory channel corresponding to the event information.
4. The ARM architecture server of claim 1, wherein the at least one peripheral device comprises a PCI-E device, wherein the event information corresponds to the PCI-E device, and wherein the event handling operation comprises performing a PCI-E reset.
5. A management method for ARM architecture server, wherein the ARM architecture server includes at least a peripheral device, a substrate management controller and an ARM processor, the management method includes:
the substrate management controller monitors and judges whether the at least one peripheral device and the ARM processor are abnormal or not;
the substrate management controller generates event information according to a judgment result, wherein the event information corresponds to one of the ARM processor or the at least one peripheral device;
the baseboard management controller transmits the event information to the ARM processor;
executing an event processing operation on the ARM processor or the peripheral device corresponding to the event information by using ARM trusted firmware in the ARM processor, wherein the event processing operation comprises that the ARM trusted firmware can directly process or repair the abnormal component in the ARM architecture server; and
the ARM architecture server comprises a plurality of exception levels, wherein an operating system of the ARM architecture server runs at a first exception level, the ARM trusted firmware runs at a second exception level which is not lower than the first exception level, and the ARM trusted firmware can access all the ARM processor and plug-in or non-plug-in peripheral devices of various interfaces.
6. The method of claim 5, wherein the event information corresponds to the ARM processor, and wherein the event handling operations comprise:
and adjusting the working frequency of the ARM processor.
7. The method of claim 5, wherein the at least one peripheral device comprises a memory device, the memory device comprises at least two memory channels, and the event information corresponds to one of the memory channels, wherein the event processing operation comprises:
and closing the memory channel corresponding to the event information.
8. The method as claimed in claim 5, wherein the at least one peripheral device comprises a PCI-E device, wherein the event information corresponds to the PCI-E device, and the event processing operation comprises:
a PCI-E reset is performed.
CN201710810974.XA 2017-09-11 2017-09-11 ARM architecture server and management method thereof Active CN109491813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710810974.XA CN109491813B (en) 2017-09-11 2017-09-11 ARM architecture server and management method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710810974.XA CN109491813B (en) 2017-09-11 2017-09-11 ARM architecture server and management method thereof

Publications (2)

Publication Number Publication Date
CN109491813A CN109491813A (en) 2019-03-19
CN109491813B true CN109491813B (en) 2022-07-08

Family

ID=65687516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710810974.XA Active CN109491813B (en) 2017-09-11 2017-09-11 ARM architecture server and management method thereof

Country Status (1)

Country Link
CN (1) CN109491813B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576842A (en) * 2008-05-07 2009-11-11 英业达股份有限公司 System and method for monitoring baseboard management controller
CN103020545A (en) * 2012-12-21 2013-04-03 浪潮电子信息产业股份有限公司 Over-temperature protection method based on Loongson processor
CN104639380A (en) * 2013-11-07 2015-05-20 英业达科技有限公司 Server monitoring method
CN104699589A (en) * 2013-12-09 2015-06-10 鸿富锦精密工业(深圳)有限公司 Fan error detection system and method
CN105607972A (en) * 2015-12-28 2016-05-25 Tcl集团股份有限公司 Abnormity remedying method and device
CN106598814A (en) * 2016-12-26 2017-04-26 郑州云海信息技术有限公司 Design method for realizing overheating protection on server system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424041A (en) * 2013-08-23 2015-03-18 鸿富锦精密工业(深圳)有限公司 System and method for processing error

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576842A (en) * 2008-05-07 2009-11-11 英业达股份有限公司 System and method for monitoring baseboard management controller
CN103020545A (en) * 2012-12-21 2013-04-03 浪潮电子信息产业股份有限公司 Over-temperature protection method based on Loongson processor
CN104639380A (en) * 2013-11-07 2015-05-20 英业达科技有限公司 Server monitoring method
CN104699589A (en) * 2013-12-09 2015-06-10 鸿富锦精密工业(深圳)有限公司 Fan error detection system and method
CN105607972A (en) * 2015-12-28 2016-05-25 Tcl集团股份有限公司 Abnormity remedying method and device
CN106598814A (en) * 2016-12-26 2017-04-26 郑州云海信息技术有限公司 Design method for realizing overheating protection on server system

Also Published As

Publication number Publication date
CN109491813A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
EP2989579B1 (en) Redundant system boot code in a secondary non-volatile memory
US9971609B2 (en) Thermal watchdog process in host computer management and monitoring
CN104636221B (en) Computer system fault processing method and device
EP2989547B1 (en) Repairing compromised system data in a non-volatile memory
US11132314B2 (en) System and method to reduce host interrupts for non-critical errors
WO2009014951A1 (en) Remote access diagnostic device and methods thereof
US20150220411A1 (en) System and method for operating system agnostic hardware validation
WO2018095107A1 (en) Bios program abnormal processing method and apparatus
US20170091017A1 (en) Fault detecting device and method
US8838952B2 (en) Information processing apparatus with secure boot capability capable of verification of configuration change
CN108292342B (en) Notification of intrusions into firmware
US20170262341A1 (en) Flash memory-hosted local and remote out-of-service platform manageability
US20050033952A1 (en) Dynamic scheduling of diagnostic tests to be performed during a system boot process
TWI635401B (en) Arm-based server and managenent method thereof
CN109491813B (en) ARM architecture server and management method thereof
JP6800935B2 (en) How to control a fan in an electronic system
JP5561791B2 (en) Information processing apparatus, information processing method, and information processing program
US20170344360A1 (en) Protecting firmware flashing from power operations
US11714696B2 (en) Custom baseboard management controller (BMC) firmware stack watchdog system and method
CN106484438B (en) Computer startup method and system
US11797679B2 (en) Trust verification system and method for a baseboard management controller (BMC)
CN110781517B (en) Method for realizing data interaction by BIOS and BMC communication
US9176806B2 (en) Computer and memory inspection method
US8543755B2 (en) Mitigation of embedded controller starvation in real-time shared SPI flash architecture
US11620199B1 (en) Method and system for detection of post routine deviation for a network device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Taiwan Xindian District, New Taipei City Chinese Po Road No. 6

Applicant after: GIGA-BYTE TECHNOLOGY Co.,Ltd.

Address before: Chinese Taiwan Taipei City store Bao Jiang Road No. 6

Applicant before: GIGA-BYTE TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230327

Address after: 7th Floor, No. 6, Baoqiang Road, Xindian District, Xinbei City, Taiwan, China, China

Patentee after: Technical Steel Technology Co.,Ltd.

Address before: Taiwan Xindian District, New Taipei City Chinese Po Road No. 6

Patentee before: GIGA-BYTE TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right