CN109491813B

CN109491813B - ARM architecture server and management method thereof

Info

Publication number: CN109491813B
Application number: CN201710810974.XA
Authority: CN
Inventors: 王绍宇; 孙佩傑
Original assignee: Giga Byte Technology Co Ltd
Current assignee: Technical Steel Technology Co ltd
Priority date: 2017-09-11
Filing date: 2017-09-11
Publication date: 2022-07-08
Anticipated expiration: 2037-09-11
Also published as: CN109491813A

Abstract

The invention provides an ARM architecture server, which comprises at least one peripheral device, a substrate management controller and an ARM processor, wherein the ARM processor comprises ARM Trusted Firmware (ATF). The baseboard management controller is used for monitoring and judging whether at least one peripheral device and the ARM processor are abnormal or not, and generating event information corresponding to the ARM processor or one of the peripheral devices according to a judgment result. The ARM trusted firmware is used for receiving the event information from the baseboard management controller and executing an event processing operation on an ARM processor or a peripheral device corresponding to the event information. In addition, a management method of the ARM architecture server is also provided.

Description

ARM architecture server and management method thereof

Technical Field

The present invention relates to a server management method, and more particularly, to an ARM architecture server capable of automatically removing obstacles and a management method thereof.

Background

A Baseboard Management Controller (BMC) is used to manage a server system. Generally, in order to monitor whether the internal operation of the computer system is normal, a user may utilize a bmc configured on a motherboard to test the computer system. A common approach includes a remote bmc to detect readings from various sensors (e.g., fan speed or processor temperature) used to sense the operation of various components of a computer system. When a user finds that the reading value of the sensor is abnormal, the user must go to the site to repair the server (for example, replace parts and the like). However, the excessive reaction time may cause more serious damage and data loss after the server is abnormal. Therefore, in order to maintain the normal operation and good service of the server, too long response time after abnormal sensor reading is not allowed.

Disclosure of Invention

The invention provides an ARM architecture server and a management method thereof, which can automatically repair when a BMC detects that an element is abnormal, so that the server can normally operate without interruption.

The invention provides an ARM architecture server, which comprises at least one peripheral device, a substrate management controller and an ARM processor. The baseboard management controller is coupled to the at least one peripheral device, and is used for monitoring and judging whether the at least one peripheral device and the ARM processor are abnormal or not, and generating event information corresponding to the ARM processor or one of the peripheral devices according to a judgment result. The ARM processor is coupled to the at least one peripheral device and the baseboard management controller, and includes ARM Trusted Firmware (ATF). The ARM trusted firmware is used for receiving the event information from the baseboard management controller and executing an event processing operation on an ARM processor or a peripheral device corresponding to the event information.

From another perspective, the present invention provides a management method for an ARM architecture server. The ARM architecture server comprises at least one peripheral device, a substrate management controller and an ARM processor. The management method comprises the following steps: the substrate management controller monitors and judges whether the at least one peripheral device and the ARM processor are abnormal or not; the substrate management controller generates event information corresponding to the ARM processor or one of the peripheral devices according to the judgment result; the substrate management controller transmits the event information to the ARM processor; and executing event processing operation on the ARM processor or the peripheral device corresponding to the event information by using ARM reliable firmware in the ARM processor.

In an embodiment of the invention, the event information corresponds to an ARM processor, and the event processing operation includes adjusting an operating frequency of the ARM processor.

In an embodiment of the invention, the peripheral device includes a memory device having at least two memory channels, the event information corresponds to one of the memory channels, and the event processing operation includes closing the memory channel corresponding to the event information.

In an embodiment of the invention, the peripheral device includes a PCI-E device, the event information corresponds to the PCI-E device, and the event handling operation includes performing a PCI-E reset.

In an embodiment of the invention, the ARM architecture server includes a plurality of exception levels, wherein an operating system of the ARM architecture server runs at a first exception level, and the ARM trusted firmware runs at a second exception level that is not lower than the first exception level.

Based on the above, in the ARM architecture server and the management method thereof provided by the embodiments of the present invention, the substrate management controller notifies the ARM trusted firmware of the abnormal event, and the ARM trusted firmware directly processes the abnormal device. Therefore, the user can repair the ARM architecture server in time without installing a monitoring program in the operating system, and the safety can be also considered.

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a schematic block diagram of an ARM architecture server according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a management method of an ARM architecture server according to an embodiment of the present invention.

Wherein the reference numerals are:

100: ARM architecture server

110: baseboard management controller

120: at least one peripheral device

130: ARM processor

131: ARM trusted firmware

S210-S240: management method steps of ARM architecture server

Detailed Description

Fig. 1 is a schematic block diagram of an ARM architecture server according to an embodiment of the present invention. Referring to fig. 1, an ARM architecture server 100 according to an embodiment of the present invention includes a bmc 110, at least one peripheral device 120, and an ARM processor 130, wherein the bmc 110 and the ARM processor 130 are coupled to each peripheral device 120. In particular, the bmc 110 is also coupled to the ARM processor.

In one embodiment, the ARM architecture server 100 is, for example but not limited to, an ARMv8-a architecture, which includes a plurality of Exception levels (Exception levels), and a higher Exception level indicates a higher access right (privilege). For example, ARM architecture server 100 includes four exceptions EL0 through EL3Levels, where EL0 is an unprivileged level (unprivileged), EL1 is an operating system kernel mode (OS kernel mode), EL2 is a Hypervisor level (Hypervisor mode), and EL3 is a TrustZone^®Monitoring hierarchy (TrustZone)^® monitor mode）。

The bmc 110 is connected to each peripheral device 120 through an Intelligent Platform Management Bus (IPMB), for example, to monitor each peripheral device 120. In one embodiment, the peripheral device 120 includes a sensor for monitoring a fan speed or a processor temperature, a Dual-channel (Dual-channel) Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and a PCI-E Ethernet (Ethernet) card, but the invention is not limited thereto. The description of the bmc 110 and the monitoring server of each peripheral device 120 is provided for a person of ordinary skill in the art to obtain sufficient teachings from the prior art, and therefore will not be described herein again.

ARM processor 130 is a Reduced Instruction Set (RISC) architecture designed processor, such as an ARM Cortex-a, ARM Cortex-M, Cortex-a50 series, or CortexA-73 processor, but the invention is not limited thereto.

In one embodiment, ARM processor 130 includes ARM Trusted Firmware (ATF) 131 to provide ATF services. It is worth mentioning that ARM trusted firmware 131 is running at an exception level not lower than the operating system. For example, the operating system of ARM architecture server 100 runs at a first exception level (e.g., EL 1), and ARM trusted firmware 131 runs at a second exception level (e.g., EL 3) that is not lower than the first exception level. Thus, ARM trusted firmware 131 may access all ARM processors 130 themselves as well as all external or non-external peripheral devices 120 for various interfaces (e.g., interfaces such as SATA, PCI-E, LAN, GPIO, SPI, or I2C). ARM trusted firmware 131 and the ATF services that it is capable of providing are well known in the art and will not be described herein in detail.

In particular, when the bmc 110 detects an abnormality of the ARM processor 130 itself or the peripheral device 120, the ARM processor 130 is notified of the abnormality. Since ARM trusted firmware 131 in ARM processor 130 is running at an exception level that is no lower than the operating system, ARM trusted firmware 131 is able to directly handle or repair the component of ARM architecture server 100 that is experiencing the exception.

Fig. 2 is a flowchart illustrating a management method of an ARM architecture server according to an embodiment of the present invention. The management method in the embodiment of fig. 2 is applied to ARM architecture server 100 in fig. 1, and the detailed steps of the method in the embodiment of fig. 2 will be described below with reference to the components of ARM architecture server 100 in fig. 1.

Referring to fig. 2, in step S210, the bmc 110 monitors and determines whether at least one of the peripheral devices 120 and the ARM processor 130 is abnormal.

For example, the bmc 110 may monitor a temperature sensor in the ARM processor 130 to determine whether the ARM processor 130 is overheating; the bmc 110 may monitor the memory device in the ARM architecture server 100 to determine whether it is operating normally; or the bmc 110 may monitor whether the PCI-E devices on the PCI-E bus of the ARM architecture server 100 are functioning properly, for example, but not limited thereto. The skilled person can obtain sufficient teaching from the bmc and the prior knowledge to set the abnormal state of each component and complete the above operation for determining whether each component is abnormal, and therefore, the detailed description thereof is omitted here.

If no abnormality is found in the bmc 110, the step S210 is continuously performed. Otherwise, if the bmc 110 determines that the ARM processor 130 or one of the peripheral devices 120 is abnormal, the process proceeds to step S220, and the bmc 110 generates event information according to the determination result and transmits the event information to the ARM processor 130 in step S230.

In detail, the event information generated by the bmc 110 according to the determination result corresponds to the ARM processor 130 or the peripheral device 120 having an abnormality. For example, when the bmc 110 determines that the ARM processor 130 is overheated, it generates an event message indicating that the ARM processor 130 is overheated; when the bmc 110 determines that the memory device is not operating normally (e.g., the error bits of the data are too many and the error correction code mechanism cannot be corrected), event information indicating that the memory device cannot operate normally is generated.

In step S240, the ARM trusted firmware 131 in the ARM processor 130 receives the event information from the bmc 110 and performs an event processing operation on the ARM processor 130 or the peripheral device 120 corresponding to the event information, so as not to interrupt the operation of the ARM architecture server 100.

In one embodiment, the ARM processor 130 is connected to the peripheral device 120 (e.g., a temperature sensor), and the event information corresponds to the ARM processor 130, for example, indicating that the ARM processor 130 is overheated. After ARM trusted firmware 131 receives the event information, it performs event processing operations on ARM processor 130. For example, ARM trusted firmware 131 may reduce the operating frequency of ARM processor 130, or adjust the level of a CPU temperature regulator (throttle) in ARM processor 130, for example, to achieve the cooling effect. Thus, even if the ARM processor 130 is abnormal (overheated), it can still process the data in time by the ARM trusted firmware to avoid the serious damage that may cause the ARM architecture server 100 to stop service due to the interruption of operation.

In one embodiment, ARM architecture server 100 includes a peripheral device 120 (e.g., a dual channel ddr sdram) and the event information corresponds to one of the memory channels of the ddr sdram, e.g., indicating that the memory of the memory channel is not functioning properly. After ARM trusted firmware 131 receives the event message, it performs an event processing operation on the memory channel 130 corresponding to the event message. For example, ARM trusted firmware 131 may close the memory channel corresponding to the event message, and reserve the memory of another memory channel for normal operation. As such, ARM architecture server 100 can still continue to operate without interruption of the service.

In one embodiment, ARM architecture server 100 includes peripheral devices 120 (e.g., PCI-E devices), and the event information corresponds to one of the PCI-E devices or terminals (e.g., PCI-E ethernet cards), for example, indicating that the PCI-E device is not operating properly. After ARM trusted firmware 131 receives the event information, it performs event processing operations on the PCI-E device. For example, ARM trusted firmware 131 may perform a PCI-E reset (PCI-E reset) operation, for example, in an attempt to repair a PCI-E device. Thus, the PCI-E reset operation can be performed without restarting the ARM architecture server 100, so as to repair the PCI-E device and restore the normal operation of the PCI-E device. Those skilled in the art will appreciate that sufficient teachings can be obtained from the related art for PCI-E reset to accomplish the PCI-E reset operation described in this embodiment, and thus, the detailed description thereof is omitted here.

In summary, the ARM architecture server and the management method thereof provided by the embodiments of the present invention utilize the ATF firmware in the ARM processor to directly process or repair the abnormal component in the ARM architecture server, so as to avoid the server from being seriously damaged and losing data. On the other hand, the user does not need to install an additional monitoring program in the operating system, the risk of confidential data outflow caused by hiding a backdoor program in the monitoring program can be avoided, and the safety is improved for the service of the server.

Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An ARM architecture server, comprising:

at least one peripheral device;

a baseboard management controller coupled to the at least one peripheral device; and

an ARM processor coupled to the at least one peripheral device and the baseboard management controller, wherein the ARM processor includes an ARM trusted firmware,

wherein the baseboard management controller is used for monitoring and judging whether the at least one peripheral device and the ARM processor are abnormal or not, and generating an event message according to a judgment result,

wherein the ARM trusted firmware is configured to receive the event information from the BMC, wherein the event information corresponds to one of the ARM processor or the at least one peripheral device,

the ARM trusted firmware is further used for executing an event processing operation on the ARM processor or the peripheral device corresponding to the event information, and the event processing operation comprises that the ARM trusted firmware can directly process or repair an abnormal component in the ARM architecture server;

the ARM architecture server comprises a plurality of exception levels, wherein the higher the exception level is, the higher the access authority is, an operating system of the ARM architecture server runs at a first exception level, and the ARM trusted firmware runs at a second exception level which is not lower than the first exception level, and can access all the ARM processor and plug-in or non-plug-in peripheral devices of various interfaces.

2. The ARM architecture server of claim 1, wherein the event information corresponds to the ARM processor, wherein the event processing operation comprises adjusting an operating frequency of the ARM processor.

3. The ARM architecture server of claim 1, wherein the at least one peripheral device includes a memory device, the memory device including at least two memory channels,

the event information corresponds to one of the memory channels, and the event processing operation includes closing the memory channel corresponding to the event information.

4. The ARM architecture server of claim 1, wherein the at least one peripheral device comprises a PCI-E device, wherein the event information corresponds to the PCI-E device, and wherein the event handling operation comprises performing a PCI-E reset.

5. A management method for ARM architecture server, wherein the ARM architecture server includes at least a peripheral device, a substrate management controller and an ARM processor, the management method includes:

the substrate management controller monitors and judges whether the at least one peripheral device and the ARM processor are abnormal or not;

the substrate management controller generates event information according to a judgment result, wherein the event information corresponds to one of the ARM processor or the at least one peripheral device;

the baseboard management controller transmits the event information to the ARM processor;

executing an event processing operation on the ARM processor or the peripheral device corresponding to the event information by using ARM trusted firmware in the ARM processor, wherein the event processing operation comprises that the ARM trusted firmware can directly process or repair the abnormal component in the ARM architecture server; and

the ARM architecture server comprises a plurality of exception levels, wherein an operating system of the ARM architecture server runs at a first exception level, the ARM trusted firmware runs at a second exception level which is not lower than the first exception level, and the ARM trusted firmware can access all the ARM processor and plug-in or non-plug-in peripheral devices of various interfaces.

6. The method of claim 5, wherein the event information corresponds to the ARM processor, and wherein the event handling operations comprise:

and adjusting the working frequency of the ARM processor.

7. The method of claim 5, wherein the at least one peripheral device comprises a memory device, the memory device comprises at least two memory channels, and the event information corresponds to one of the memory channels, wherein the event processing operation comprises:

and closing the memory channel corresponding to the event information.

8. The method as claimed in claim 5, wherein the at least one peripheral device comprises a PCI-E device, wherein the event information corresponds to the PCI-E device, and the event processing operation comprises:

a PCI-E reset is performed.