WO2023179684A1 - 一种中央处理器状态监测方法、装置、设备、存储介质 - Google Patents

一种中央处理器状态监测方法、装置、设备、存储介质 Download PDF

Info

Publication number
WO2023179684A1
WO2023179684A1 PCT/CN2023/083130 CN2023083130W WO2023179684A1 WO 2023179684 A1 WO2023179684 A1 WO 2023179684A1 CN 2023083130 W CN2023083130 W CN 2023083130W WO 2023179684 A1 WO2023179684 A1 WO 2023179684A1
Authority
WO
WIPO (PCT)
Prior art keywords
status information
temperature
status
central processor
processing unit
Prior art date
Application number
PCT/CN2023/083130
Other languages
English (en)
French (fr)
Inventor
梅飞
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023179684A1 publication Critical patent/WO2023179684A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations

Definitions

  • the present application relates to the technical field of server management software, and in particular to a central processor status monitoring method, device, equipment, and storage medium.
  • CPU Central Processing Unit, central processing unit
  • CPU Prochot processor overheating
  • CPU Prochot processor errors
  • Error processor overheating
  • the CPU Prochot signal is triggered when the CPU temperature reaches the preset high temperature threshold.
  • CPLD Complex Programmable Logic Device
  • VR Voltage Regulator
  • the purpose of this application is to provide a central processor status monitoring method, device, equipment, and storage medium that can accurately monitor the central processor status and accurately realize abnormal status alarms, which is conducive to timely adjustment by operation and maintenance personnel. Cooling strategy or troubleshooting.
  • the specific plan is as follows:
  • This application discloses a central processor status monitoring method, which is applied to a baseboard management controller, including:
  • the current status information of the central processor recorded in the preset register inside the central processor is read through a dedicated single-line bus that has been established in advance to communicate with the central processor, and the current status information is saved locally.
  • the platform environment control interface that has established a communication connection with the central processing unit in advance, the current temperature status information of the central processing unit recorded in the preset register inside the central processing unit is read, and the current temperature status information is saved locally.
  • determining whether the current status information is consistent with the locally saved previous status information of the central processor includes:
  • the platform environment control interface reads the current temperature status information of the central processor recorded in the preset register inside the central processor, and saves the current temperature status information locally.
  • corresponding abnormal status alarms are performed according to the preset abnormal status alarm rules, including:
  • the temperature status abnormality reporting command is triggered, and the temperature abnormality status is recorded through the baseboard management controller. Alarm logs and corresponding temperature abnormal status alarms.
  • the abnormal status alarm is correspondingly cleared according to the preset abnormal status alarm rules, including:
  • Detect and record the server s system time every time the CPU is in an abnormal temperature state
  • the system time of the current server and the last temperature abnormal status information of the central processor are calculated. The time difference between the server’s system time
  • selecting whether to release the abnormal status alarm is based on the time difference and the temperature status information of the central processor detected by the temperature sensor built into the voltage regulator, including:
  • the operation of clearing the abnormal status alarm will not be performed, and the system will jump to execution again.
  • the preset register inside the central processor will be read. The steps of recording the current temperature status information of the central processor and saving the current temperature status information locally.
  • selecting whether to release the abnormal status alarm is based on the time difference and the temperature status information of the central processor detected by the temperature sensor built into the voltage regulator, including:
  • the time difference is greater than the preset time difference, and the temperature status information of the central processor detected by the temperature sensor built into the voltage regulator is the normal temperature status information, the log generated by the normal temperature status is recorded through the baseboard management controller and the abnormal status alarm is cleared.
  • This application discloses a central processing unit status monitoring device, which includes:
  • the information reading module is used to read the current status information of the central processing unit recorded in the preset register inside the central processing unit through a dedicated single-line bus that has established a communication connection with the central processing unit in advance, and save the current status information locally. ;
  • the information judgment module is used to judge whether the current status information is consistent with the previous status information of the locally saved central processor
  • the status monitoring module is used to issue a corresponding abnormal status alarm or cancel the abnormal status alarm according to the preset abnormal status alarm rules if the current status information is inconsistent with the previous status information.
  • This application discloses an electronic device, including:
  • Memory used to hold computer programs
  • the processor is configured to execute a computer program to implement the steps of the CPU status monitoring method disclosed above.
  • the present application discloses a non-volatile readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the steps of the central processor status monitoring method disclosed above are implemented.
  • this application discloses a central processor status monitoring method, which is applied to the baseboard management controller, including: reading the preset register inside the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance. Record the current status information of the central processor and save the current status information locally; determine whether the current status information is consistent with the previous status information of the locally saved central processor; if there is a difference between the current status information and the previous status information If they are inconsistent, the corresponding abnormal status alarm will be issued or the abnormal status alarm will be cleared according to the preset abnormal status alarm rules.
  • this application directly obtains the current status information of the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance, and can obtain accurate current status information of the central processor, which is conducive to maintaining the good health of the central processor. It improves the performance and extends the service life, while trying to avoid server downtime and other problems caused by the high temperature of the central processor, which has objective economic benefits. Then, the corresponding abnormal state alarm is issued or the abnormal state alarm is cleared according to the preset abnormal state alarm rules, which can effectively prevent the occurrence of misclearance of alarms.
  • Figure 1 is a flow chart of a central processor status monitoring method disclosed in this application.
  • Figure 2 is a flow chart of a specific central processor status monitoring method disclosed in this application.
  • Figure 3 is a flow chart of a specific central processor status monitoring method disclosed in this application.
  • Figure 4 is a schematic structural diagram of a central processor status monitoring device disclosed in this application.
  • Figure 5 is a structural diagram of an electronic device disclosed in this application.
  • the CPLD can only obtain the ambient temperature near the CPU detected by the VR chip, and then determines whether to trigger the CPU Prochot signal based on the ambient temperature near the CPU. Therefore, because the VR chip detects that the ambient temperature near the CPU lags behind the CPU core temperature, the BMC cannot obtain the CPU Prochot status through the CPLD and trigger the alarm in time.
  • this application provides a central processor status monitoring solution that can achieve accurate central processor status monitoring and accurately implement abnormal status alarms, which will help operation and maintenance personnel timely adjust the cooling strategy or troubleshoot faults.
  • an embodiment of the present application discloses a central processor status monitoring method, which is applied to a baseboard management controller, specifically including:
  • Step S11 Read the current status information of the central processor recorded in the preset register inside the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance, and save the current status information locally.
  • the current temperature status information of the central processor recorded in the preset register inside the central processor is read through a platform environment control interface that has established a communication connection with the central processor in advance, and the current temperature status information is saved locally. Temperature status information. It is understandable that BMC regularly reads the value of bit 0 in the CPU Package Thermal Status register through PECI (Platform Environment Control Interface, platform environment control interface). Bit 0 in the register is the bit that represents the CPU Prochot status, where , 1 means it is in Prochot state, 0 means it is in normal state, and the temperature status information is saved locally.
  • PECI Platinum Environment Control Interface, platform environment control interface
  • Step S12 Determine whether the current status information is consistent with the locally saved previous status information of the central processor.
  • the platform environment control interface that has established a communication connection with the central processing unit in advance, read the central processing unit recorded in the preset register inside the central processing unit. the current temperature status information, and save the current temperature status information locally.
  • the current temperature status information read is compared with the previous temperature status information saved locally, for example: the value of bit 0 in the current CPU Package Thermal Status register is detected, and then the previous temperature status information is fetched from the local Compare the temperature status information with the bit 0 value read this time and the bit 0 value read last time. If the bit 0 value currently read is 0, the value of bit 0 read last time is 0. The value of bit 0 is 0, and the comparison results are consistent, indicating that the internal temperature of the CPU twice is in a normal state, and the BMC does not need to report it; if the value of bit 0 currently read is 1, the last read bit 0 The value of the bit is 1, and the comparison results are consistent, indicating that the temperature inside the CPU twice is in an abnormal state. It is still in the temperature alarm state at this time, indicating that the temperature inside the CPU has not changed, and the BMC does not need to report it.
  • Step S13 If there is inconsistency between the current status information and the previous status information, perform a corresponding abnormal status alarm or cancel the abnormal status alarm according to the preset abnormal status alarm rules.
  • the temperature status abnormality reporting instruction is triggered, and the temperature status abnormality reporting instruction is triggered and passed through the substrate management
  • the controller records the alarm log generated by the temperature abnormality and issues corresponding temperature abnormality alarm. It can be understood that if the value of bit 0 in the current CPU Package Thermal Status register is detected to be 1, and the value of bit 0 of the last time saved locally is 0, it indicates that the last detected CPU temperature status is a normal state. It is currently detected that the CPU temperature status is abnormal, the temperature status information of the two previous times is inconsistent, and the current temperature status is abnormal. Note: After the CPU triggers/releases a complete cycle of Prochot, the temperature status abnormality reporting command is triggered, and the BMC needs to record the alarm log and perform corresponding abnormal status alarms.
  • the abnormal status alarm is cleared according to the abnormal status alarm rules. It can be understood that if the value of bit 0 in the current CPU Package Thermal Status register is detected to be 0, and the value of bit 0 last time saved locally is 1, it indicates that the last detected CPU temperature status is an abnormal state. It is currently detected that the CPU temperature status is normal. The temperature status information of the two previous times is inconsistent, and the current temperature status is normal. At this time, because bit 0 of the CPU Package Thermal Status register is in a oscillating state, the abnormal status alarm cannot be cleared immediately. , it is also necessary to further determine whether to cancel the status alarm based on the abnormal status alarm rules.
  • the CPU ERROR status can include but is not limited to: IERR (internal error, internal error), Processor Disabled (processor damage), UCE (Uncorrectable Machine Check Exception, processor Unrecoverable error), CE (Correctable Machine Check Error, processor recoverable error), etc.
  • this application discloses a central processor status monitoring method, which is applied to the baseboard management controller, including: reading the preset register inside the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance. Record the current status information of the central processor and save the current status information locally; determine whether the current status information is consistent with the previous status information of the locally saved central processor; if there is a difference between the current status information and the previous status information If they are inconsistent, the corresponding abnormal status alarm will be issued or the abnormal status alarm will be cleared according to the preset abnormal status alarm rules.
  • this application directly obtains the current status information of the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance, and can obtain accurate current status information of the central processor, which is conducive to maintaining good performance of the central processor. performance and extended service life, and at the same time try to avoid problems such as server downtime caused by the high temperature of the central processor, which has objective economic benefits. Then, the corresponding abnormal state alarm is issued or the abnormal state alarm is cleared according to the preset abnormal state alarm rules, which can effectively prevent the occurrence of misclearance of alarms.
  • embodiments of the present application disclose a specific central processor status monitoring method. specific:
  • Step S21 Read the current temperature status information of the central processor recorded in the preset register inside the central processor through the platform environment control interface that has established a communication connection with the central processor in advance, and save the current temperature status information locally.
  • the current temperature status information of the central processor recorded in the preset register inside the central processor is read through PECI, and the current temperature status information is saved locally. It can be understood that directly reading through PECI The current temperature status information inside the CPU is recorded in the preset register inside the CPU, instead of the ambient temperature near the CPU detected by the VR chip obtained through CPLD. BMC can obtain the CPU Prochot status timely and accurately through PECI.
  • Step S22 Determine whether the current temperature status information is consistent with the locally saved previous temperature status information of the central processor.
  • Step S23 Detect and record the system time of the server each time the central processor is in an abnormal temperature state; if the current temperature status information is inconsistent with the locally saved previous temperature status information of the central processor, and the current temperature status information is in a normal temperature state information, calculate the time difference between the current system time of the server and the system time of the server when the CPU received the last temperature abnormal status information.
  • the system time of the server is detected and recorded every time the central processor is in an abnormal temperature state. For example, when the value of bit 0 of the CPU Package Thermal Status register is detected to be 1, the system time of the server is recorded and saved. Time; if the current temperature status information is inconsistent with the last temperature status information of the locally saved central processor, such as: now_value ⁇ last_value, and the value of bit 0 of the CPU Package Thermal Status register is detected to be 0, that is, the current The temperature status information is the temperature normal status information. At this time, because the CPU core temperature has just risen to the prohot threshold, bit 0 is in a oscillating state, that is, the value jumps repeatedly between 0 and 1. In order to prevent misunderstanding and clearing the abnormal status alarm, it is necessary to calculate The time difference between the current server's system time and the server's system time when the CPU's last temperature abnormal status information was received is used to determine whether to clear the abnormal status alarm.
  • Step S24 Select whether to cancel the abnormal status alarm based on the time difference and the temperature status information of the central processor detected by the temperature sensor built into the voltage regulator.
  • the system time of the current server is compared with the time difference between the last recorded time when bit 0 of the CPU Package Thermal Status register was 1 and the preset time difference. In some embodiments, when the time difference is less than the preset time difference, If the time difference is reached, the operation of clearing the abnormal status alarm will not be performed, and the system will jump to execution again.
  • the central processor Through the platform environment control interface that has established a communication connection with the central processor in advance, the central processor will read the central processor recorded in the preset register inside the central processor. the current temperature status information, and save the current temperature status information locally. It can be understood that the preset time difference is 20s.
  • time_now-time_last the preset time difference time_now-time_last ⁇ 20s, which means that bit 0 of the CPU Package Thermal Status register is in a oscillating state at this time and the abnormal status alarm cannot be cleared. And continue to jump back to execution through the platform environment control interface that has established a communication connection with the central processor in advance, and reads the contents of the central processor.
  • the baseboard management controller records the log generated by the normal temperature status. and clear the abnormal status alarm. It can be understood that the preset time difference is 20s. If the time difference is 26s, then time_now-time_last>20s, which is greater than the preset time difference. At this time, it is necessary to detect the ambient temperature state near the central processor based on the temperature sensor built into the VR chip.
  • the VR chip detects that the ambient temperature near the central processor is also in a normal temperature state, the log generated by the normal temperature state is recorded through the BMC and the abnormal state alarm is cleared; if the VR chip detects that the ambient temperature near the central processor is If it is in an abnormal temperature state, the abnormal state alarm will not be cleared.
  • the embodiment of this application reads the value of bit 0 of the CPU Package Thermal Status register through PECI, so that the internal Prochot status of the CPU can be monitored accurately in real time.
  • This method solves the problem that the BMC of the EGS platform cannot monitor the high temperature alarm of the server CPU core temperature. It is faster than the original method of reading the Prochot pin to transmit signals through CPLD transparent transmission, and through this abnormal status alarm or timely and accurate Removing abnormal status alarms can report to the administrator in a more timely manner, which will help operation and maintenance personnel adjust cooling strategies or troubleshoot faults in a timely manner. It will help maintain good CPU performance and extend service life, and at the same time try to avoid problems caused by high CPU temperatures. Problems such as server downtime have objective economic benefits.
  • a central processing unit status monitoring device which includes:
  • the information reading module 11 is used to read the current status information of the central processing unit recorded in the preset register inside the central processing unit through a dedicated single-line bus that has established a communication connection with the central processing unit in advance, and save the current status locally. information;
  • the information judgment module 12 is used to judge whether the current status information is consistent with the previous status information of the locally saved central processor
  • the status monitoring module 13 is used to issue a corresponding abnormal status alarm or cancel the abnormal status alarm according to the preset abnormal status alarm rules if the current status information is inconsistent with the previous status information.
  • this application discloses a central processor status monitoring method, which is applied to the baseboard management controller, including: reading the preset register inside the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance. Record the current status information of the central processor and save the current status information locally; judge the current status information and the local Whether the saved previous status information of the central processor is consistent; if there is inconsistency between the current status information and the previous status information, the corresponding abnormal status alarm will be issued or the abnormal status alarm will be cleared according to the preset abnormal status alarm rules.
  • this application directly obtains the current status information of the central processor through a dedicated single-line bus that has established a communication connection with the central processor in advance, and can obtain accurate current status information of the central processor, which is conducive to maintaining good performance of the central processor. performance and extended service life, and at the same time try to avoid problems such as server downtime caused by the high temperature of the central processor, which has objective economic benefits. Then, the corresponding abnormal state alarm is issued or the abnormal state alarm is cleared according to the preset abnormal state alarm rules, which can effectively prevent the occurrence of misclearance of alarms.
  • Figure 5 is an exemplary structural diagram of the electronic device 20.
  • the content in the figure cannot be considered as any limitation on the scope of use of the present application.
  • FIG. 5 is a schematic structural diagram of an electronic device 20 provided by an embodiment of the present application.
  • the electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25 and a communication bus 26.
  • the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the central processor status monitoring method disclosed in any of the foregoing embodiments.
  • the electronic device 20 in this embodiment may specifically be an electronic computer.
  • the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows is Any communication protocol that can be applied to the technical solution of this application is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the external world, and its specific interface type can be selected according to specific application needs. No specific limitation is made here.
  • the memory 22, as a carrier for resource storage can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.
  • the resources stored thereon can include an operating system 221, a computer program 222, etc., and the storage method can be short-term storage or permanent storage. .
  • the operating system 221 is used to manage and control each hardware device and the computer program 222 on the electronic device 20, which can be Windows Server, Netware, Unix, Linux, etc.
  • the computer program 222 may further include computer programs that can be used to complete other specific tasks.
  • this application also discloses a non-volatile readable storage medium for storing a computer program; wherein, when the computer program is executed by the processor, the aforementioned disclosed central processor status monitoring method is implemented.
  • the specific steps of this method reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.
  • Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种中央处理器状态监测方法、装置、设备、存储介质,包括:通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致;如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。通过本申请能够获取精确的中央处理器的当前状态信息并及时上报告知管理员,有利于维持中央处理器良好的使用性能和延长使用寿命,同时尽量避免了因中央处理器高温导致的服务器宕机等问题,能够有效防止告警误解除的情况发生。

Description

一种中央处理器状态监测方法、装置、设备、存储介质
相关申请的交叉引用
本申请要求于2022年3月25日提交中国专利局,申请号为202210302352.7,申请名称为“一种中央处理器状态监测方法、装置、设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及服务器管理软件技术领域,特别涉及一种中央处理器状态监测方法、装置、设备、存储介质。
背景技术
当前,CPU(Central Processing Unit,中央处理器)作为服务器系统的运算和控制的核心部件,在使用过程中需要对CPU的状态进行监测,防止出现处理器过热(CPU Prochot)或处理器错误(CPU Error)。CPU Prochot信号会在CPU温度达到预设高温阈值时触发。
目前,在EGS(Eagle Stream)平台,由于CPU Prochot管脚被设计为单向输入管脚,CPLD(Complex Programmable Logic Device,复杂可编程逻辑器件)只能获取VR(Voltage Regulator,电压调节器)芯片检测的CPU附近环境温度、进而根据CPU附近环境温度情况决定是否触发CPU Prochot信号。因此,由于VR芯片检测到CPU附近环境温度迟滞于CPU核心温度,使得BMC(Baseboard Management Controller,基板管理控制器)无法及时通过CPLD获取CPU Prochot状态并及时触发告警。
发明内容
有鉴于此,本申请的目的在于提供一种中央处理器状态监测方法、装置、设备、存储介质,能够准确进行中央处理器状态监测,并能准确实现异常状态告警,有利于运维人员及时调整散热策略或排查故障。其具体方案如下:
本申请公开了一种中央处理器状态监测方法,应用于基板管理控制器,包括:
通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;
判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致;
如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。
本申请一些实施例中,通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息,包括:
通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息。
本申请一些实施例中,判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致,包括:
如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息一致,则不进行相应的异常状态告警或者解除异常状态告警,并重新跳转至执行通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息的步骤。
本申请一些实施例中,如果当前状态信息和本地保存的中央处理器的上一状态信息不一致,则根据预设的异常状态告警规则进行相应的异常状态告警,包括:
如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息不一致,且当前温度状态信息为温度异常状态信息,则触发温度状态异常上报指令,并通过基板管理控制器记录温度异常状态产生的告警日志并进行相应的温度异常状态告警。
本申请一些实施例中,如果当前状态信息和本地保存的中央处理器的上一状态信息不一致,则根据预设的异常状态告警规则进行相应的解除异常状态告警,包括:
检测并记录每一次中央处理器处于温度异常状态时服务器的系统时间;
如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息不一致,且当前温度状态信息为温度正常状态信息,则计算当前服务器的系统时间与中央处理器的上一温度异常状态信息时服务器的系统时间的时间差;
根据时间差以及电压调节器内置的温度传感器检测的中央处理器温度状态信息选择是否解除异常状态告警。
本申请一些实施例中,根据时间差以及电压调节器内置的温度传感器检测的中央处理器温度状态信息选择是否解除异常状态告警,包括:
当时间差小于预设时间差,则不进行解除异常状态告警的操作,并重新跳转至执行通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息的步骤。
本申请一些实施例中,根据时间差以及电压调节器内置的温度传感器检测的中央处理器温度状态信息选择是否解除异常状态告警,包括:
如果时间差大于预设时间差,并且电压调节器内置的温度传感器检测到的中央处理器温度状态信息为温度正常状态信息,则通过基板管理控制器记录温度正常状态产生的日志并解除异常状态告警。
本申请公开了一种中央处理器状态监测装置,包括:
信息读取模块,用于通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;
信息判断模块,用于判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致;
状态监测模块,用于如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。
本申请公开了一种电子设备,包括:
存储器,用于保存计算机程序;
处理器,用于执行计算机程序,以实现如前述公开的中央处理器状态监测方法的步骤。
本申请公开了一种非易失性可读存储介质,用于存储计算机程序;其中,计算机程序被处理器执行时实现如前述公开的中央处理器状态监测方法的步骤。
可见,本申请公开了一种中央处理器状态监测方法,应用于基板管理控制器,包括:通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致;如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。由此可见,本申请通过预先与中央处理器建立通信连接的专用单线型总线直接获取中央处理器的当前状态信息,能够获取精确的中央处理器的当前状态信息,有利于维持中央处理器良好 的使用性能和延长使用寿命,同时尽量避免了因中央处理器高温导致的服务器宕机等问题,具有客观的经济效益。然后根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警,能够有效地防止了告警误解除的情况发生。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请公开的一种中央处理器状态监测方法流程图;
图2为本申请公开的一种具体的中央处理器状态监测方法流程图;
图3为本申请公开的一种具体的中央处理器状态监测方法流程图;
图4为本申请公开的一种中央处理器状态监测装置结构示意图;
图5为本申请公开的一种电子设备结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
当前,在EGS平台,由于CPU Prochot管脚被设计为单向输入管脚,CPLD只能获取VR芯片检测的CPU附近环境温度、进而根据CPU附近环境温度情况决定是否触发CPU Prochot信号。因此,由于VR芯片检测到CPU附近环境温度迟滞于CPU核心温度,使得BMC无法及时通过CPLD获取CPU Prochot状态并及时触发告警。
为此,本申请提供了一种中央处理器状态监测方案,能够实现准确的中央处理器状态监测,并能准确实现异常状态告警,进而有利于运维人员及时调整散热策略或排查故障。
参照图1所示,本申请实施例公开了一种中央处理器状态监测方法,应用于基板管理控制器,具体包括:
步骤S11:通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息。
本申请一些实施例中,通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息。可以理解的是,BMC定期通过PECI(Platform Environment Control Interface,平台环境式控制接口)读取CPU Package Thermal Status寄存器中bit 0位的值,寄存器中的bit 0位为表征CPU Prochot状态的位,其中,1表示处于Prochot状态,0表示处于正常状态,并将温度状态信息保存至本地。
步骤S12:判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致。
本申请一些实施例中,判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致,如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息一致,则不进行相应的异常状态告警或者解除异常状态告警,并重新跳转至执行通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息的步骤。可以理解的是,根据读取到的当前温度状态信息与本地保存的上一温度状态信息进行比较,例如:检测到当前的CPU Package Thermal Status寄存器中bit 0位的值,然后从本地中取出上一温度状态信息进行比较,比较此次读取到的bit 0位的数值与上一次读取到的bit 0位数值,如果当前读取到的bit 0位的数值为0,上次读取的bit 0位的数值为0,比较结果一致,说明两次的CPU内部的温度都属于正常状态,BMC无需上报;如果当前读取到的bit 0位的数值为1,上次读取的bit 0位的数值为1,比较结果一致,说明两次的CPU内部的温度都属于异常状态,此时仍处于温度告警状态,表明CPU内部的温度一直未改变,BMC无需上报。
步骤S13:如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。
本申请一些实施例中,如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息不一致,且当前温度状态信息为温度异常状态信息,则触发温度状态异常上报指令,并通过基板管理控制器记录温度异常状态产生的告警日志并进行相应的温度异常状态告警。可以理解的是,检测到当前的CPU Package Thermal Status寄存器中bit 0位的值为1,本地保存的上一次的bit 0位的值为0,则表明上一次检测到CPU温度状态为正常状态,而当前检测到CPU温度状态为异常状态,前后两次的温度状态信息不一致,且当前温度状态为异常状态, 说明CPU上一个Prochot触发/解除完整周期后,触发温度状态异常上报指令,BMC需要记录告警日志并进行相应的异常状态告警。
本申请一些实施例中,如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息不一致,且当前温度状态信息为温度正常状态信息,则根据异常状态告警规则解除异常状态告警。可以理解的是,检测到当前的CPU Package Thermal Status寄存器中bit 0位的值为0,本地保存的上一次的bit 0位的值为1,则表明上一次检测到CPU温度状态为异常状态,而当前检测到CPU温度状态为正常状态,前后两次的温度状态信息不一致,且当前温度状态为正常状态,这时由于CPU Package Thermal Status寄存器的bit0位处于震荡状态,还不能立即解除异常状态告警,还需要基于异常状态告警规则,进一步判断是否解除状态告警。
进一步的,本申请还可以对CPU ERROR状态进行监测,CPU ERROR状态具体可以包括但不限于:IERR(internal error,内部错误)、Processor Disabled(处理器损坏)、UCE(Uncorrectable Machine Check Exception,处理器不可恢复性错误)、CE(Correctable Machine Check Error,处理器可恢复性错误)等。
可见,本申请公开了一种中央处理器状态监测方法,应用于基板管理控制器,包括:通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致;如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。由此可见,本申请通过预先与中央处理器建立通信连接的专用单线型总线直接获取中央处理器的当前状态信息,能够获取精确的中央处理器的当前状态信息,有利于维持中央处理器良好的使用性能和延长使用寿命,同时尽量避免了因中央处理器高温导致的服务器宕机等问题,具有客观的经济效益。然后根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警,能够有效地防止了告警误解除的情况发生。
参照图2和图3所示,本申请实施例公开了一种具体的中央处理器状态监测方法。具体的:
步骤S21:通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息。
本申请一些实施例中,通过PECI读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息,可以理解的是,通过PECI直接读取中央处理器内部的预设寄存器中记录的中央处理器内部的当前温度状态信息,而不是通过CPLD获取的由VR芯片检测CPU附近的环境温度,BMC能够通过PECI及时准确地获取CPU Prochot状态。
步骤S22:判断当前温度状态信息和本地保存的中央处理器的上一温度状态信息之间是否一致。
步骤S23:检测并记录每一次中央处理器处于温度异常状态时服务器的系统时间;如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息不一致,且当前温度状态信息为温度正常状态信息,则计算当前服务器的系统时间与中央处理器的上一温度异常状态信息时服务器的系统时间的时间差。
本申请一些实施例中,检测并记录每一次中央处理器处于温度异常状态时服务器的系统时间,例如:检测到CPU Package Thermal Status寄存器的bit 0位的值为1时,记录并保存服务器的系统时间;如果当前温度状态信息和本地保存的中央处理器的上一温度状态信息不一致,如:now_value≠last_value时,且检测到CPU Package Thermal Status寄存器的bit 0位的值为0时,也即当前温度状态信息为温度正常状态信息,此时由于CPU核心温度刚上升至prochot阈值,bit 0位处于震荡状态,也即数值在0和1间反复跳变,为防止误解除异常状态告警,需要计算当前服务器的系统时间与中央处理器的上一温度异常状态信息时服务器的系统时间的时间差,进而确定是否解除异常状态告警。
步骤S24:根据时间差以及电压调节器内置的温度传感器检测的中央处理器温度状态信息选择是否解除异常状态告警。
本申请一些实施例中,比较当前服务器的系统时间与记录的上一次CPU Package Thermal Status寄存器的bit 0为1时的时间差与预设时间差的大小关系,在一些实施例中,当时间差小于预设时间差,则不进行解除异常状态告警的操作,并重新跳转至执行通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息的步骤。可以理解的是,预设时间差为20s,如果时间差为13s,则小于预设时间差time_now-time_last<20s,说明此时的CPU Package Thermal Status寄存器的bit 0位处于震荡状态,不能解除异常状态告警,并继续重新跳转至执行通过预先与中央处理器建立通信连接的平台环境式控制接口,读取中央处理器内 部的预设寄存器中记录的中央处理器的当前温度状态信息,并在本地保存当前温度状态信息的步骤。
本申请一些实施例中,如果时间差大于预设时间差,并且电压调节器内置的温度传感器检测到的中央处理器温度状态信息为温度正常状态信息,则通过基板管理控制器记录温度正常状态产生的日志并解除异常状态告警。可以理解的是,预设时间差为20s,如果时间差为26s,则time_now-time_last>20s,大于预设时间差,此时需要根据VR芯片中内置的温度传感器中检测的中央处理器附近的环境温度状态进行辅助判断,如果VR芯片检测到中央处理器附近的环境温度也处于正常温度状态,则通过BMC记录温度正常状态产生的日志并解除异常状态告警;如果VR芯片检测到中央处理器附近的环境温度处于异常温度状态,则不解除异常状态告警。
可见,本申请实施例通过PECI读取CPU Package Thermal Status寄存器bit 0位的值,可以实时、准确监控CPU内部Prochot状态。这种方式解决了EGS平台BMC无法监控服务器CPU核心温度高温告警的问题,并且较原先通过CPLD透传读取Prochot管脚传递信号的方式更为迅捷,并且通过这种异常状态告警或者及时准确的进行解除异常状态告警,可以更为及时地上报告知管理员,利于运维人员及时调整散热策略或排查故障,有利于维持CPU良好的使用性能和延长使用寿命,同时尽量避免了因CPU高温导致的服务器宕机等问题,具有客观的经济效益。
参照图4所示,本申请实施例公开了一种中央处理器状态监测装置,包括:
信息读取模块11,用于通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;
信息判断模块12,用于判断当前状态信息和本地保存的中央处理器的上一状态信息之间是否一致;
状态监测模块13,用于如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。
可见,本申请公开了一种中央处理器状态监测方法,应用于基板管理控制器,包括:通过预先与中央处理器建立通信连接的专用单线型总线,读取中央处理器内部的预设寄存器中记录的中央处理器的当前状态信息,并在本地保存当前状态信息;判断当前状态信息和本地 保存的中央处理器的上一状态信息之间是否一致;如果当前状态信息和上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。由此可见,本申请通过预先与中央处理器建立通信连接的专用单线型总线直接获取中央处理器的当前状态信息,能够获取精确的中央处理器的当前状态信息,有利于维持中央处理器良好的使用性能和延长使用寿命,同时尽量避免了因中央处理器高温导致的服务器宕机等问题,具有客观的经济效益。然后根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警,能够有效地防止了告警误解除的情况发生。
进一步的,本申请实施例还公开了一种电子设备,图5是示例性示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围的任何限制。
图5为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20,具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,存储器22用于存储计算机程序,计算机程序由处理器21加载并执行,以实现前述任一实施例公开的中央处理器状态监测方法中的相关步骤。另外,本实施例中的电子设备20具体可以为电子计算机。
本申请一些实施例中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括操作系统221、计算机程序222等,存储方式可以是短暂存储或者永久存储。
其中,操作系统221用于管理与控制电子设备20上的各硬件设备以及计算机程序222,其可以是Windows Server、Netware、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的中央处理器状态监测方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。
进一步的,本申请还公开了一种非易失性可读存储介质,用于存储计算机程序;其中,计算机程序被处理器执行时实现前述公开的中央处理器状态监测方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上对本申请所提供的一种中央处理器状态监测方法、装置、设备、存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种中央处理器状态监测方法,其特征在于,应用于基板管理控制器,包括:
    通过预先与中央处理器建立通信连接的专用单线型总线,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前状态信息,并在本地保存所述当前状态信息;
    判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致;
    如果所述当前状态信息和所述上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。
  2. 根据权利要求1所述的中央处理器状态监测方法,其特征在于,所述通过预先与中央处理器建立通信连接的专用单线型总线,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前状态信息,并在本地保存所述当前状态信息,包括:
    通过预先与中央处理器建立通信连接的平台环境式控制接口,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前温度状态信息,并在本地保存所述当前温度状态信息。
  3. 根据权利要求2所述的中央处理器状态监测方法,其特征在于,所述通过预先与中央处理器建立通信连接的平台环境式控制接口,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前温度状态信息,包括:
    通过预先与中央处理器建立通信连接的平台环境式控制接口,读取所述中央处理器内部的预设寄存器中表征中央处理器温度状态的比特位的值。
  4. 根据权利要求2所述的中央处理器状态监测方法,其特征在于,所述判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致,包括:
    如果所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息一致,则不进行相应的异常状态告警或者解除异常状态告警,并重新跳转至执行所述通过预先与中央处理器建立通信连接的平台环境式控制接口,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前温度状态信息,并在本地保存所述当前温度状态信息的步骤。
  5. 根据权利要求4所述的中央处理器状态监测方法,其特征在于,所述判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致,包括:
    如果当前读取到的表征中央处理器温度状态的比特位的数值为0,且本地保存的上一次读取到的表征中央处理器温度状态的比特位的数值为0,则确定所述当前温度状态信息 和本地保存的所述中央处理器的上一温度状态信息一致,且所述当前状态信息和本地保存的所述中央处理器的上一状态信息均为温度正常状态信息。
  6. 根据权利要求4所述的中央处理器状态监测方法,其特征在于,所述判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致,包括:
    如果当前读取到的表征中央处理器温度状态的比特位的数值为1,且本地保存的上一次读取到的表征中央处理器温度状态的比特位的数值为1,则确定所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息一致,且所述当前状态信息和本地保存的所述中央处理器的上一状态信息均为温度异常状态信息。
  7. 根据权利要求4所述的中央处理器状态监测方法,其特征在于,所述判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致,包括:
    如果当前读取到的表征中央处理器温度状态的比特位的数值为1,且本地保存的上一次读取到的表征中央处理器温度状态的比特位的数值为0,则确定所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息不一致,且所述当前状态信息为温度异常状态信息,本地保存的所述中央处理器的上一温度状态信息为温度正常状态信息。
  8. 根据权利要求4所述的中央处理器状态监测方法,其特征在于,所述判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致,包括:
    如果当前读取到的表征中央处理器温度状态的比特位的数值为0,且本地保存的上一次读取到的表征中央处理器温度状态的比特位的数值为1,则确定所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息不一致,且所述当前状态信息为温度正常状态信息,本地保存的所述中央处理器的上一温度状态信息为温度异常状态信息。
  9. 根据权利要求2所述的中央处理器状态监测方法,其特征在于,所述如果所述当前状态信息和本地保存的所述中央处理器的上一状态信息不一致,则根据预设的异常状态告警规则进行相应的异常状态告警,包括:
    如果所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息不一致,且所述当前温度状态信息为温度异常状态信息,则触发温度状态异常上报指令,并通过基板管理控制器记录温度异常状态产生的告警日志并进行相应的温度异常状态告警。
  10. 根据权利要求2所述的中央处理器状态监测方法,其特征在于,所述如果所述当前状态信息和本地保存的所述中央处理器的上一状态信息不一致,则根据预设的异常状态告警规则进行相应的解除异常状态告警,包括:
    如果所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息不一 致,且所述当前温度状态信息为温度正常状态信息,则根据预设的异常状态告警规则进行相应的解除异常状态告警。
  11. 根据权利要求10所述的中央处理器状态监测方法,其特征在于,所述如果所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息不一致,且所述当前温度状态信息为温度正常状态信息,则根据预设的异常状态告警规则进行相应的解除异常状态告警,包括:
    检测并记录每一次中央处理器处于温度异常状态时服务器的系统时间;
    如果所述当前温度状态信息和本地保存的所述中央处理器的上一温度状态信息不一致,且所述当前温度状态信息为温度正常状态信息,则计算当前服务器的系统时间与所述中央处理器的上一温度异常状态信息时服务器的系统时间的时间差;
    根据所述时间差以及电压调节器内置的温度传感器检测的所述中央处理器温度状态信息选择是否解除异常状态告警。
  12. 根据权利要求11所述的中央处理器状态监测方法,其特征在于,所述根据所述时间差以及电压调节器内置的温度传感器检测的所述中央处理器温度状态信息选择是否解除异常状态告警,包括:
    当所述时间差小于预设时间差,则不进行解除异常状态告警的操作,并重新跳转至执行所述通过预先与中央处理器建立通信连接的平台环境式控制接口,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前温度状态信息,并在本地保存所述当前温度状态信息的步骤。
  13. 根据权利要求12所述的中央处理器状态监测方法,其特征在于,所述当所述时间差小于预设时间差,则不进行解除异常状态告警的操作,包括:
    当所述时间差小于预设时间差,则确定所述中央处理器内部的预设寄存器中表征中央处理器温度状态的比特位处于震荡状态,不进行解除异常状态告警的操作。
  14. 根据权利要求11所述的中央处理器状态监测方法,其特征在于,所述根据所述时间差以及电压调节器内置的温度传感器检测的所述中央处理器温度状态信息选择是否解除异常状态告警,包括:
    如果所述时间差大于预设时间差,并且电压调节器内置的温度传感器检测到的所述中央处理器温度状态信息为温度正常状态信息,则通过基板管理控制器记录温度正常状态产生的日志并解除异常状态告警。
  15. 根据权利要求14所述的中央处理器状态监测方法,其特征在于,所述如果所述 时间差大于预设时间差,并且电压调节器内置的温度传感器检测到的所述中央处理器温度状态信息为温度正常状态信息,则通过基板管理控制器记录温度正常状态产生的日志并解除异常状态告警,包括:
    如果所述时间差大于预设时间差,并且VR芯片中电压调节器内置的温度传感器检测到的所述中央处理器附近的环境温度处于正常温度状态,则通过基板管理控制器记录温度正常状态产生的日志并解除异常状态告警。
  16. 根据权利要求11所述的中央处理器状态监测方法,其特征在于,所述根据所述时间差以及电压调节器内置的温度传感器检测的所述中央处理器温度状态信息选择是否解除异常状态告警,包括:
    如果所述时间差大于预设时间差,并且VR芯片中电压调节器内置的温度传感器检测到的所述中央处理器附近的环境温度处于异常温度状态,则不解除异常状态告警。
  17. 根据权利要求1所述的中央处理器状态监测方法,其特征在于,所述方法还包括:
    监测所述中央处理器的错误状态信息;其中,所述错误状态信息包括但不限于:内部错误、处理器损坏、处理器不可恢复性错误、处理器可恢复性错误。
  18. 一种中央处理器状态监测装置,其特征在于,包括:
    信息读取模块,用于通过预先与中央处理器建立通信连接的专用单线型总线,读取所述中央处理器内部的预设寄存器中记录的所述中央处理器的当前状态信息,并在本地保存所述当前状态信息;
    信息判断模块,用于判断所述当前状态信息和本地保存的所述中央处理器的上一状态信息之间是否一致;
    状态监测模块,用于如果所述当前状态信息和所述上一状态信息之间不一致,则根据预设的异常状态告警规则进行相应的异常状态告警或者解除异常状态告警。
  19. 一种电子设备,其特征在于,包括:
    存储器,用于保存计算机程序;
    处理器,用于执行所述计算机程序,以实现如权利要求1至17任一项所述的中央处理器状态监测方法的步骤。
  20. 一种非易失性可读存储介质,其特征在于,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现如权利要求1至17任一项所述的中央处理器状态监测方法的步骤。
PCT/CN2023/083130 2022-03-25 2023-03-22 一种中央处理器状态监测方法、装置、设备、存储介质 WO2023179684A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210302352.7 2022-03-25
CN202210302352.7A CN114676019B (zh) 2022-03-25 2022-03-25 一种中央处理器状态监测方法、装置、设备、存储介质

Publications (1)

Publication Number Publication Date
WO2023179684A1 true WO2023179684A1 (zh) 2023-09-28

Family

ID=82076556

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083130 WO2023179684A1 (zh) 2022-03-25 2023-03-22 一种中央处理器状态监测方法、装置、设备、存储介质

Country Status (2)

Country Link
CN (1) CN114676019B (zh)
WO (1) WO2023179684A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676019B (zh) * 2022-03-25 2024-06-28 苏州浪潮智能科技有限公司 一种中央处理器状态监测方法、装置、设备、存储介质
CN115129524A (zh) * 2022-06-29 2022-09-30 苏州浪潮智能科技有限公司 一种检测vr芯片异常状态方法、系统、设备以及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60195649A (ja) * 1984-03-16 1985-10-04 Nec Corp マイクロプログラム制御型デ−タ処理装置におけるエラ−報告方式
CN108089964A (zh) * 2017-12-07 2018-05-29 郑州云海信息技术有限公司 一种通过bmc监控服务器cpld状态的装置及方法
CN108268360A (zh) * 2018-01-19 2018-07-10 郑州云海信息技术有限公司 一种bmc获取内存温度的方法、系统、装置及存储介质
CN109656767A (zh) * 2018-12-21 2019-04-19 广东浪潮大数据研究有限公司 一种cpld状态信息的获取方法、系统及相关组件
CN111767184A (zh) * 2020-09-01 2020-10-13 苏州浪潮智能科技有限公司 一种故障诊断方法、装置及电子设备和存储介质
CN114676019A (zh) * 2022-03-25 2022-06-28 苏州浪潮智能科技有限公司 一种中央处理器状态监测方法、装置、设备、存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376107A (zh) * 2018-03-01 2018-08-07 郑州云海信息技术有限公司 一种服务器故障检测的方法、装置、设备及存储介质
CN108304299A (zh) * 2018-03-02 2018-07-20 郑州云海信息技术有限公司 服务器上电状态监测系统及方法、计算机存储器及设备
CN113708986B (zh) * 2020-05-21 2023-02-03 富联精密电子(天津)有限公司 服务器监控装置、方法及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60195649A (ja) * 1984-03-16 1985-10-04 Nec Corp マイクロプログラム制御型デ−タ処理装置におけるエラ−報告方式
CN108089964A (zh) * 2017-12-07 2018-05-29 郑州云海信息技术有限公司 一种通过bmc监控服务器cpld状态的装置及方法
CN108268360A (zh) * 2018-01-19 2018-07-10 郑州云海信息技术有限公司 一种bmc获取内存温度的方法、系统、装置及存储介质
CN109656767A (zh) * 2018-12-21 2019-04-19 广东浪潮大数据研究有限公司 一种cpld状态信息的获取方法、系统及相关组件
CN111767184A (zh) * 2020-09-01 2020-10-13 苏州浪潮智能科技有限公司 一种故障诊断方法、装置及电子设备和存储介质
CN114676019A (zh) * 2022-03-25 2022-06-28 苏州浪潮智能科技有限公司 一种中央处理器状态监测方法、装置、设备、存储介质

Also Published As

Publication number Publication date
CN114676019B (zh) 2024-06-28
CN114676019A (zh) 2022-06-28

Similar Documents

Publication Publication Date Title
WO2023179684A1 (zh) 一种中央处理器状态监测方法、装置、设备、存储介质
TWI229796B (en) Method and system to implement a system event log for system manageability
US7738366B2 (en) Methods and structure for detecting SAS link errors with minimal impact on SAS initiator and link bandwidth
US12014791B2 (en) Memory fault handling method and apparatus, device, and storage medium
US20070088988A1 (en) System and method for logging recoverable errors
US20170068607A1 (en) Systems and methods for detecting memory faults in real-time via smi tests
US20050188263A1 (en) Detecting and correcting a failure sequence in a computer system before a failure occurs
US11853150B2 (en) Method and device for detecting memory downgrade error
CN112380089A (zh) 一种数据中心监控预警方法及系统
CN110704228A (zh) 一种固态硬盘异常处理方法及系统
TW201523239A (zh) 風扇錯誤偵測系統及方法
CN116820820A (zh) 服务器故障监测方法及系统
CN111625386A (zh) 一种针对系统设备上电超时的监控方法和装置
CN116225812B (zh) 基板管理控制器系统运行方法、装置、设备及存储介质
US20210334153A1 (en) Remote error detection method adapted for a remote computer device to detect errors that occur in a service computer device
US7664797B1 (en) Method and apparatus for using statistical process control within a storage management system
CN110795276A (zh) 一种存储介质的修复方法、计算机设备、存储介质
CN117607595A (zh) 设备改进方法、装置、设备、存储介质和程序产品
US20090249031A1 (en) Information processing apparatus and error processing
JP5689783B2 (ja) コンピュータ、コンピュータシステム、および障害情報管理方法
CN114564334B (zh) 一种mrpc数据处理方法、系统及相关组件
CN114936135A (zh) 一种异常检测方法、装置及可读存储介质
US20190042125A1 (en) External indicators for adaptive in-field recalibration
US7992047B2 (en) Context sensitive detection of failing I/O devices
TWI494754B (zh) 伺服器監控裝置和其操作方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23773931

Country of ref document: EP

Kind code of ref document: A1