CN115437819A

CN115437819A - Error reporting method and device for server, computer equipment and storage medium

Info

Publication number: CN115437819A
Application number: CN202210968319.8A
Authority: CN
Inventors: 张国奇
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2022-12-06

Abstract

The invention discloses an error reporting method, an error reporting device, computer equipment and a storage medium of a server, wherein the method comprises the following steps of responding to the occurrence of error reporting in a fatal error form in a baseboard management controller after the server is powered on and executing the following steps based on a basic input and output system: acquiring the running state of a substrate management controller; responding to the operating state of the substrate management controller as an initial operating state, and performing abnormal repair on the substrate management controller; after the substrate management controller is abnormally repaired, acquiring the running information of the substrate management controller from the central processing unit; judging whether the substrate management controller is abnormal or not based on the operation information; and responding to the abnormity of the substrate management controller, judging the type of the abnormity, and sending the judgment result to the central processing unit. By the scheme of the invention, the problem of error reporting in a fatal error form caused by communication overtime between the substrate management controller and the central processing unit in the starting process is avoided, and the server can keep high-efficiency and reliable operation.

Description

Error reporting method and device for server, computer equipment and storage medium

Technical Field

The present invention relates to the field of server technologies, and in particular, to an error reporting method and apparatus for a server, a computer device, and a storage medium.

Background

The server is used as one of important carriers for the background operation of the network, and can be used as a carrier of a plurality of network services and equipment besides maintaining the operation and data storage of the network. The server can be regarded as an IT information service basic carrier, each server is the existence of the fusion of basic hardware and software, different servers have different CPU (Central Processing Unit) design structures, and if the hardware moves different platform modes, different problems may occur. The intel CPU is the most mature in the server market, but various problems arise if the AMD (Advanced Micro Devices, advanced semiconductors) architecture CPU moves the intel part of the hardware.

The invention mainly aims at the problem of abnormal error reporting of an Intel CPU hardware design scheme adopted by a server with an AMD architecture, a PCIE (peripheral component interface express) device taking a BMC (Baseboard Management Controller) main control chip as a CPU periphery, and if the link between the BMC and the CPU is abnormal, serious errors are often generated, which brings inconvenience to an operator and even causes serious problems. For example, in the process of using the server, especially in the process of restarting the server, a problem that an error log misrereports a fatal error of the BMC device easily occurs, and an operator cannot distinguish a communication timeout problem caused by abnormal communication from a problem of the device itself for the fatal error problem, which greatly affects the use of the operator.

Disclosure of Invention

In view of this, the present invention provides an error reporting method and apparatus for a server, a computer device, and a storage medium, which can identify an error reporting problem occurring in the form of a fatal error during a server startup process, and ensure that the server can maintain efficient and reliable operation.

Based on the above object, an aspect of the embodiments of the present invention provides an error reporting method for a server, which specifically includes: after the server is powered on, in response to the occurrence of an error report in the form of a fatal error by the baseboard management controller, executing the following steps based on the basic input output system:

acquiring the running state of a substrate management controller;

responding to the operating state of the baseboard management controller as an initial operating state, and performing abnormal repair on the baseboard management controller;

after the substrate management controller is subjected to abnormal repair, acquiring the running information of the substrate management controller from a central processing unit;

judging whether the substrate management controller is abnormal or not based on the operation information;

and responding to the abnormity of the substrate management controller, judging the type of the abnormity, and sending the judgment result to the central processing unit.

In some embodiments, in response to the operating state of the baseboard management controller being an initialization operating state, performing exception repair on the baseboard management controller includes:

and in response to the operating state of the baseboard management controller being an initialized operating state, performing exception repair on the baseboard management controller based on the type of the initialized operating state.

In some embodiments, performing an exception fix for the baseboard management controller based on the type of the initialized operating state includes:

setting a corresponding time threshold value based on the type of the initialization running state, and sending information to the baseboard management controller;

and responding to the returned information of the substrate management controller received within the time threshold, and sending the message of normal communication of the substrate management controller to the central processing unit.

In some embodiments, further comprising the steps of:

and in response to the fact that the return information of the substrate management controller is not received within the time threshold, restarting the substrate management controller, and sending a message that the substrate management controller is being restarted to the central processing unit.

In some embodiments, sending the determination to the central processor comprises:

and in response to the fact that the exception is communication time overtime between the substrate management controller and the central processing unit, determining that the exception type is communication time overtime, and sending a message of the communication time overtime to the central processing unit.

and responding to the exception of the hardware of the baseboard management controller, and sending a message of the hardware exception of the baseboard management controller to the central processing unit.

In some embodiments, the server is an AMD server, and the initialized running state of the baseboard management controller includes any one of the following states: the method comprises the steps of powering on and starting the substrate management controller, reboot restarting the substrate management controller, and automatically restarting the substrate management controller.

On the other hand, the embodiment of the invention also provides an error reporting device of the server, which is applied to the basic input and output system, and the error reporting device comprises a first acquisition module, a repair module, a second acquisition module, a judgment module and a sending module;

the first acquisition module is configured to acquire the running state of the baseboard management controller;

the repairing module is configured to respond to the operating state of the baseboard management controller being an initial operating state, and perform abnormal repairing on the baseboard management controller;

the second acquisition module is configured to acquire the running information of the substrate management controller from a central processing unit after the substrate management controller is subjected to abnormal repair;

the judging module is configured to judge whether the substrate management controller is abnormal or not based on the running information;

the sending module is configured to respond to the substrate management controller, judge the type of the abnormity and send the judgment result to the central processing unit.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing a computer program executable on the processor, the computer program when executed by the processor implementing the steps of the method:

acquiring the running state of a substrate management controller;

In some embodiments, performing exception repair on the baseboard management controller based on the type of the initialized operating state includes:

setting a corresponding time threshold value based on the type of the initialized running state, and sending information to the baseboard management controller;

In some embodiments, further comprising the steps of:

In some embodiments, the server is an AMD server, and the initialized running state of the baseboard management controller includes any one of the following states: the method comprises the following steps of powering on and starting the baseboard management controller, reboot restarting the baseboard management controller, and automatically restarting the baseboard management controller.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has at least the following beneficial technical effects: acquiring the running state of a baseboard management controller; responding to the operating state of the substrate management controller as an initial operating state, and performing abnormal repair on the substrate management controller; after the substrate management controller is abnormally repaired, acquiring the running information of the substrate management controller from the central processing unit; judging whether the substrate management controller is abnormal or not based on the operation information; and responding to the abnormity of the substrate management controller, judging the type of the abnormity, and sending the judgment result to the central processing unit, thereby avoiding the error reporting problem of fatal error form caused by communication overtime between the substrate management controller and the central processing unit in the starting process, and ensuring that the server can keep high-efficiency and reliable operation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a block diagram of an embodiment of an error reporting method of a server according to the present invention;

FIG. 2 is a diagram illustrating an error reporting apparatus of a server according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a computer device provided in the present invention;

fig. 4 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

The BMC is used as firmware of the server, and is a PCIE device relative to a CPU, the BMC is generally plugged with a power supply, if the server is in a power-on state (the server is powered on and automatically started), the power-on process of the server is generally that the firmware BMC is started preferentially, the CPU is started immediately, interaction with the firmware BMC can be carried out in the CPU starting process, if abnormal communication or link timeout occurs in the interaction process, the problem of fatal errors can be reported, an operating system cannot distinguish the problem of error reporting in the form of fatal errors, and the problem brings serious influence to the use of an operator.

In view of the above object, a first aspect of the embodiments of the present invention provides an embodiment of an error reporting method for a server. As shown in fig. 1, after the server is powered on, in response to the occurrence of an error report in the form of a fatal error in the bmc, the method includes the following steps performed based on the bios:

s10, acquiring the running state of the substrate management controller;

s20, responding to the fact that the running state of the substrate management controller is an initialization running state, and performing exception repair on the substrate management controller;

s30, after the substrate management controller is subjected to abnormal repair, acquiring the running information of the substrate management controller from a central processing unit;

s40, judging whether the substrate management controller is abnormal or not based on the running information;

and S50, responding to the abnormity of the substrate management controller, judging the abnormity type, and sending the judgment result to the central processing unit.

In a specific embodiment, the error reporting levels include a plurality of levels, such as general, serious, and fatal errors, the exception of the fatal error is usually the most serious, and after the bios obtains the error report in the form of the fatal error of the bmc, the operating state of the bmc is obtained to distinguish whether the bmc is in a normal operating state or an initialization state; if the system is in a normal operation state, the basic input and output system reports the error information to the central processing unit and the server system; if the operation state is initialized, abnormal repairing is carried out on a baseboard management controller (hereinafter referred to as BMC), and the repairing process specifically comprises the following steps: a basic input/output system (BIOS) is used for monitoring a communication part of the BMC, so that whether the communication timeout problem exists in the initialization process of the BMC serving as PCIE equipment is identified, and false alarm caused by overlong communication time is prevented. The BMC initialization process comprises the following steps: in the BMC power-on starting process, in the BMC reboot restarting process, in the BMC automatic restarting process. After abnormal restoration is carried out on the BMC by the BIOS, after the BMC is powered on and started or restarted, the communication condition between the BMC and the CPU and the communication condition between PCIE equipment consisting of the BMC and peripheral components of the BMC are monitored, the real operation condition of the BMC is obtained, whether the communication timeout problem exists in the BMC is judged, if the communication timeout problem exists in the BMC, the communication timeout problem of the BMC is reported to the CPU, and the abnormal error report of a fatal error form appearing before the CPU is reported to be a non-abnormal problem.

In a specific embodiment, based on an initialization operation state where an abnormal error report occurs to the BMC, the BMC initialization state includes a BMC power-on start state, a BMC reboot restart state, and a BMC automatic restart state, and sets a corresponding time threshold, for example, if the BMC is in the power-on start state, the time threshold is set to 2min, if the BMC is in the reboot restart state, the time threshold is set to 1min, if the BMC is in the automatic restart state, the time threshold is set to 50s, so that the time for acquiring the BMC communication state is prolonged, the error report problem of a fatal error form caused by communication timeout between the substrate management controller and the central processing unit in the starting process is avoided, and the server is ensured to be capable of keeping efficient and reliable operation.

In some embodiments, further comprising the steps of:

and responding to the abnormity that the communication time between the substrate management controller and the central processing unit is overtime, determining that the abnormity type is communication time overtime, and sending a message of the communication time overtime to the central processing unit.

On the other hand, the embodiment of the present invention further provides an error reporting device for a server, which is applied to a basic input/output system, where the error reporting device 10 includes a first obtaining module 11, a repairing module 12, a second obtaining module 13, a determining module 14, and a sending module 15;

the first obtaining module 11 is configured to obtain an operating state of the baseboard management controller;

the repair module 12 is configured to perform exception repair on the baseboard management controller in response to the operating state of the baseboard management controller being an initialized operating state;

the second obtaining module 13 is configured to obtain the operation information of the baseboard management controller from the central processing unit after the baseboard management controller is abnormally repaired;

the judging module 14 is configured to judge whether the baseboard management controller is abnormal based on the operation information;

the sending module 15 is configured to respond to the occurrence of an abnormality in the baseboard management controller, determine an abnormality type, and send a determination result to the central processing unit.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 3, the embodiment of the present invention further provides a computer device 30, in which the computer device 30 comprises a processor 310 and a memory 320, the memory 320 stores a computer program 321 that can run on the processor, and the processor 310 executes the program to perform the steps of the above method.

The memory, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the error reporting method of the server in the embodiments of the present application. The processor executes various functional applications and data processing of the system by running the nonvolatile software program, instructions and modules stored in the memory, that is, the error reporting method of the server of the above method embodiment is realized.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the system, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 4, an embodiment of the present invention further provides a computer-readable storage medium 40, where the computer-readable storage medium 40 stores a computer program 410, which when executed by a processor, performs the above method.

The computer-readable storage medium (e.g., memory) herein may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM may be available in a variety of forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), or a Random Access Memory (RAM). The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments corresponding thereto.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. The numbers of the embodiments disclosed in the above embodiments of the present invention are merely for description, and do not represent the advantages or disadvantages of the embodiments. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also combinations between technical features in the above embodiments or in different embodiments are possible, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. The error reporting method of the server is characterized in that after the server is powered on, in response to the occurrence of error reporting in the form of fatal errors by a baseboard management controller, the following steps are executed based on a basic input output system:

acquiring the running state of a substrate management controller;

2. The method of claim 1, wherein in response to the operating state of the baseboard management controller being an initialization operating state, performing exception repair on the baseboard management controller comprises:

3. The method of claim 2, wherein performing exception recovery for the baseboard management controller based on the type of the initialized operating state comprises:

4. The method of claim 3, further comprising:

5. The method of claim 1, wherein sending the determination to the central processor comprises:

6. The method of claim 1, wherein sending the determination to the central processor comprises:

7. The method of claim 1, wherein the server is an AMD server, and wherein the initialized operational state of the baseboard management controller comprises any one of the following states: the method comprises the steps of powering on and starting the substrate management controller, reboot restarting the substrate management controller, and automatically restarting the substrate management controller.

8. The error reporting device of the server is characterized by being applied to a basic input and output system and comprising a first acquisition module, a repair module, a second acquisition module, a judgment module and a sending module;

the second acquisition module is configured to acquire the running information of the substrate management controller from a central processing unit after the substrate management controller is abnormally repaired;

9. A computer device, comprising:

at least one processor; and

memory storing a computer program operable on the processor, wherein the processor executes the program to perform the steps of the method according to any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.