Disclosure of Invention
The invention provides an ARM architecture server and a management method thereof, which can automatically repair when a BMC detects that an element is abnormal, so that the server can normally operate without interruption.
The invention provides an ARM architecture server, which comprises at least one peripheral device, a substrate management controller and an ARM processor. The baseboard management controller is coupled to the at least one peripheral device, and is used for monitoring and judging whether the at least one peripheral device and the ARM processor are abnormal or not, and generating event information corresponding to the ARM processor or one of the peripheral devices according to a judgment result. The ARM processor is coupled to the at least one peripheral device and the baseboard management controller, and includes ARM Trusted Firmware (ATF). The ARM trusted firmware is used for receiving the event information from the baseboard management controller and executing an event processing operation on an ARM processor or a peripheral device corresponding to the event information.
From another perspective, the present invention provides a management method for an ARM architecture server. The ARM architecture server comprises at least one peripheral device, a substrate management controller and an ARM processor. The management method comprises the following steps: the substrate management controller monitors and judges whether the at least one peripheral device and the ARM processor are abnormal or not; the substrate management controller generates event information corresponding to the ARM processor or one of the peripheral devices according to the judgment result; the substrate management controller transmits the event information to the ARM processor; and executing event processing operation on the ARM processor or the peripheral device corresponding to the event information by using ARM reliable firmware in the ARM processor.
In an embodiment of the invention, the event information corresponds to an ARM processor, and the event processing operation includes adjusting an operating frequency of the ARM processor.
In an embodiment of the invention, the peripheral device includes a memory device having at least two memory channels, the event information corresponds to one of the memory channels, and the event processing operation includes closing the memory channel corresponding to the event information.
In an embodiment of the invention, the peripheral device includes a PCI-E device, the event information corresponds to the PCI-E device, and the event handling operation includes performing a PCI-E reset.
In an embodiment of the invention, the ARM architecture server includes a plurality of exception levels, wherein an operating system of the ARM architecture server runs at a first exception level, and the ARM trusted firmware runs at a second exception level that is not lower than the first exception level.
Based on the above, in the ARM architecture server and the management method thereof provided by the embodiments of the present invention, the substrate management controller notifies the ARM trusted firmware of the abnormal event, and the ARM trusted firmware directly processes the abnormal device. Therefore, the user can repair the ARM architecture server in time without installing a monitoring program in the operating system, and the safety can be also considered.
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
Detailed Description
Fig. 1 is a schematic block diagram of an ARM architecture server according to an embodiment of the present invention. Referring to fig. 1, an ARM architecture server 100 according to an embodiment of the present invention includes a bmc 110, at least one peripheral device 120, and an ARM processor 130, wherein the bmc 110 and the ARM processor 130 are coupled to each peripheral device 120. In particular, the bmc 110 is also coupled to the ARM processor.
In one embodiment, the ARM architecture server 100 is, for example but not limited to, an ARMv8-a architecture, which includes a plurality of Exception levels (Exception levels), and a higher Exception level indicates a higher access right (privilege). For example, ARM architecture server 100 includes four exceptions EL0 through EL3Levels, where EL0 is an unprivileged level (unprivileged), EL1 is an operating system kernel mode (OS kernel mode), EL2 is a Hypervisor level (Hypervisor mode), and EL3 is a TrustZone®Monitoring hierarchy (TrustZone)® monitor mode)。
The bmc 110 is connected to each peripheral device 120 through an Intelligent Platform Management Bus (IPMB), for example, to monitor each peripheral device 120. In one embodiment, the peripheral device 120 includes a sensor for monitoring a fan speed or a processor temperature, a Dual-channel (Dual-channel) Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and a PCI-E Ethernet (Ethernet) card, but the invention is not limited thereto. The description of the bmc 110 and the monitoring server of each peripheral device 120 is provided for a person of ordinary skill in the art to obtain sufficient teachings from the prior art, and therefore will not be described herein again.
ARM processor 130 is a Reduced Instruction Set (RISC) architecture designed processor, such as an ARM Cortex-a, ARM Cortex-M, Cortex-a50 series, or CortexA-73 processor, but the invention is not limited thereto.
In one embodiment, ARM processor 130 includes ARM Trusted Firmware (ATF) 131 to provide ATF services. It is worth mentioning that ARM trusted firmware 131 is running at an exception level not lower than the operating system. For example, the operating system of ARM architecture server 100 runs at a first exception level (e.g., EL 1), and ARM trusted firmware 131 runs at a second exception level (e.g., EL 3) that is not lower than the first exception level. Thus, ARM trusted firmware 131 may access all ARM processors 130 themselves as well as all external or non-external peripheral devices 120 for various interfaces (e.g., interfaces such as SATA, PCI-E, LAN, GPIO, SPI, or I2C). ARM trusted firmware 131 and the ATF services that it is capable of providing are well known in the art and will not be described herein in detail.
In particular, when the bmc 110 detects an abnormality of the ARM processor 130 itself or the peripheral device 120, the ARM processor 130 is notified of the abnormality. Since ARM trusted firmware 131 in ARM processor 130 is running at an exception level that is no lower than the operating system, ARM trusted firmware 131 is able to directly handle or repair the component of ARM architecture server 100 that is experiencing the exception.
Fig. 2 is a flowchart illustrating a management method of an ARM architecture server according to an embodiment of the present invention. The management method in the embodiment of fig. 2 is applied to ARM architecture server 100 in fig. 1, and the detailed steps of the method in the embodiment of fig. 2 will be described below with reference to the components of ARM architecture server 100 in fig. 1.
Referring to fig. 2, in step S210, the bmc 110 monitors and determines whether at least one of the peripheral devices 120 and the ARM processor 130 is abnormal.
For example, the bmc 110 may monitor a temperature sensor in the ARM processor 130 to determine whether the ARM processor 130 is overheating; the bmc 110 may monitor the memory device in the ARM architecture server 100 to determine whether it is operating normally; or the bmc 110 may monitor whether the PCI-E devices on the PCI-E bus of the ARM architecture server 100 are functioning properly, for example, but not limited thereto. The skilled person can obtain sufficient teaching from the bmc and the prior knowledge to set the abnormal state of each component and complete the above operation for determining whether each component is abnormal, and therefore, the detailed description thereof is omitted here.
If no abnormality is found in the bmc 110, the step S210 is continuously performed. Otherwise, if the bmc 110 determines that the ARM processor 130 or one of the peripheral devices 120 is abnormal, the process proceeds to step S220, and the bmc 110 generates event information according to the determination result and transmits the event information to the ARM processor 130 in step S230.
In detail, the event information generated by the bmc 110 according to the determination result corresponds to the ARM processor 130 or the peripheral device 120 having an abnormality. For example, when the bmc 110 determines that the ARM processor 130 is overheated, it generates an event message indicating that the ARM processor 130 is overheated; when the bmc 110 determines that the memory device is not operating normally (e.g., the error bits of the data are too many and the error correction code mechanism cannot be corrected), event information indicating that the memory device cannot operate normally is generated.
In step S240, the ARM trusted firmware 131 in the ARM processor 130 receives the event information from the bmc 110 and performs an event processing operation on the ARM processor 130 or the peripheral device 120 corresponding to the event information, so as not to interrupt the operation of the ARM architecture server 100.
In one embodiment, the ARM processor 130 is connected to the peripheral device 120 (e.g., a temperature sensor), and the event information corresponds to the ARM processor 130, for example, indicating that the ARM processor 130 is overheated. After ARM trusted firmware 131 receives the event information, it performs event processing operations on ARM processor 130. For example, ARM trusted firmware 131 may reduce the operating frequency of ARM processor 130, or adjust the level of a CPU temperature regulator (throttle) in ARM processor 130, for example, to achieve the cooling effect. Thus, even if the ARM processor 130 is abnormal (overheated), it can still process the data in time by the ARM trusted firmware to avoid the serious damage that may cause the ARM architecture server 100 to stop service due to the interruption of operation.
In one embodiment, ARM architecture server 100 includes a peripheral device 120 (e.g., a dual channel ddr sdram) and the event information corresponds to one of the memory channels of the ddr sdram, e.g., indicating that the memory of the memory channel is not functioning properly. After ARM trusted firmware 131 receives the event message, it performs an event processing operation on the memory channel 130 corresponding to the event message. For example, ARM trusted firmware 131 may close the memory channel corresponding to the event message, and reserve the memory of another memory channel for normal operation. As such, ARM architecture server 100 can still continue to operate without interruption of the service.
In one embodiment, ARM architecture server 100 includes peripheral devices 120 (e.g., PCI-E devices), and the event information corresponds to one of the PCI-E devices or terminals (e.g., PCI-E ethernet cards), for example, indicating that the PCI-E device is not operating properly. After ARM trusted firmware 131 receives the event information, it performs event processing operations on the PCI-E device. For example, ARM trusted firmware 131 may perform a PCI-E reset (PCI-E reset) operation, for example, in an attempt to repair a PCI-E device. Thus, the PCI-E reset operation can be performed without restarting the ARM architecture server 100, so as to repair the PCI-E device and restore the normal operation of the PCI-E device. Those skilled in the art will appreciate that sufficient teachings can be obtained from the related art for PCI-E reset to accomplish the PCI-E reset operation described in this embodiment, and thus, the detailed description thereof is omitted here.
In summary, the ARM architecture server and the management method thereof provided by the embodiments of the present invention utilize the ATF firmware in the ARM processor to directly process or repair the abnormal component in the ARM architecture server, so as to avoid the server from being seriously damaged and losing data. On the other hand, the user does not need to install an additional monitoring program in the operating system, the risk of confidential data outflow caused by hiding a backdoor program in the monitoring program can be avoided, and the safety is improved for the service of the server.
Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.