CN111966521B

CN111966521B - Hardware error processing method, processor, controller, electronic device and storage medium

Info

Publication number: CN111966521B
Application number: CN202010828278.3A
Authority: CN
Inventors: 姜莹; 王海洋
Original assignee: Chengdu Haiguang Integrated Circuit Design Co Ltd
Current assignee: Chengdu Haiguang Integrated Circuit Design Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2023-10-13
Anticipated expiration: 2040-08-17
Also published as: CN111966521A

Abstract

The application provides a method for processing hardware errors, a processor, a controller, electronic equipment and a storage medium, wherein the method is applied to the controller in the electronic equipment, the controller is connected with monitored hardware and a memory in the electronic equipment, and an operating system is deployed in the electronic equipment, and the method comprises the following steps: when the monitored hardware is monitored to have errors, generating error information of the errors; writing error information into a preset storage area in the memory; and sending an interrupt notification of the error information to the operating system, wherein the interrupt notification is used for indicating the operating system to read the error information into the storage area. By writing the error information into a preset storage area in the memory and interrupting the notification of the operation system to read the storage area, compared with the mode of polling MSR, the method has the advantages that the storage area is not required to be polled and the reading speed is higher, so that the error information can be quickly obtained, and the quick error response is realized.

Description

Hardware error processing method, processor, controller, electronic device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method for processing a hardware error, a processor, a controller, an electronic device, and a storage medium.

Background

MCA (Machine Check Architecture ) mechanisms are often used in current existing CPU (Central Processing Unit ) architectures to self-test the hardware of an electronic device and to issue interrupts or exceptions when a hardware error is found. After the operating system receives the interrupt or the abnormality, the operating system responds to the interrupt or the abnormality and performs corresponding actions such as repairing, alarming or other strategies. Through the MCA mechanism, system software can detect hardware errors, such as system bus errors, memory ECC (Error Correcting Code, error checking and correction) errors, parity errors, cache errors, TLB errors, and the like.

Specifically, the MCA mechanism defines a set of related MSRs (Model Specific Register, special module registers), and when the controller defined in the MCA detects that an error occurs in hardware, the error information of the hardware that will occur the error is recorded in the corresponding MSR, and then an interrupt is sent to the operating system. The operating system reads the information in each MSR according to the polling that the interrupt will go to, thereby reading the error information to the hardware that sent the error. However, since the types of errors occurring in hardware are many, for example, the hardware connected to the north bridge of the processor is many, there may be tens of errors occurring in the north bridge, and the corresponding MSRs may also have tens of MSRs, so that the load of the operating system for polling to read the MSRs is great.

In order to reduce the burden of the operating system, the processing can be performed in an extended MCA mode at present, that is, the firmware in the hardware interrupt processor is firstly used for performing error detection, then the firmware is used for performing error detection, error information is collated, and finally the firmware interrupt is used for notifying the operating system, so that the processing flow of the operating system is simplified.

However, in this way, also because of the number of MSRs, it takes a long time to read the corresponding error information when the MSRs are read, resulting in firmware polling, which results in slow response time of the error.

Disclosure of Invention

An object of an embodiment of the present application is to provide a method, a processor, a controller, an electronic device, and a storage medium for processing a hardware error, so as to implement fast error response.

In a first aspect, an embodiment of the present application provides a method for processing a hardware error, which is applied to a controller in an electronic device, where the controller is connected to monitored hardware and a memory in the electronic device, and an operating system is deployed in the electronic device, and the method includes: generating error information of the error when the monitored hardware is monitored to generate the error; writing the error information into a preset storage area in the memory; and sending an interrupt notification of the error information to the operating system, wherein the interrupt notification is used for indicating the operating system to read the error information into the storage area.

In the embodiment of the application, the error information is written into the preset storage area in the memory, and the notification operation system is interrupted to read the storage area, so that compared with a mode of polling MSR, the reading of the storage area is not required, and the reading speed is higher, the error information can be quickly obtained, and the quick error response is realized.

With reference to the first aspect, in a first possible implementation manner, writing the error information into a preset storage area in the memory includes: judging whether the residual storage space in the storage area is larger than or equal to the storage space which is required to be occupied by the writing of the error information; and if the residual storage space is larger than or equal to the storage space which is required to be occupied by the writing, writing the error information into the residual storage space.

In the embodiment of the application, by judging whether the residual storage space is larger than or equal to the storage space occupied by the writing of the error information, the error information can be ensured to be correctly written, and errors are avoided.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, determining whether the remaining storage space in the storage area is greater than or equal to the storage space that needs to be occupied by the writing of the error information includes: accessing the storage area to obtain the positions pointed by the write head pointer and the write tail pointer in the storage area respectively; determining the remaining storage space according to the position; judging whether the residual storage space is larger than or equal to the storage space which needs to be occupied.

In the embodiment of the application, the positions pointed by the write head pointer and the write tail pointer in the storage area can reflect the storage condition of the storage area, so that the controller can quickly and accurately determine the size of the residual storage space by analyzing the positions pointed by the write head pointer and the write tail pointer.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, writing the error information into the remaining storage space includes: writing the error information into corresponding address bits in the residual storage space, writing a start address of the error information into corresponding one address bit in the residual storage space, and writing an end address of the error information into corresponding another address bit in the residual storage space, wherein the corresponding one address bit is adjacent to the address bit of the first bit data of the error information, and the corresponding another address bit is adjacent to the address bit of the last bit data of the error information.

In the embodiment of the application, because the format of the error area written into the storage space is 'the start address of the error information + the content of the error information + the end address of the error information', the operating system can conveniently distinguish each error information of the storage area through the format.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, after writing the error information into a preset storage area in the memory, the method further includes: and updating the position pointed by the write tail pointer to an address bit of the ending address pointed to the error information.

In the embodiment of the application, the storage condition reflected by the pointer can be ensured to be matched with the actual storage condition in the storage area by updating the pointed position of the write tail pointer to one address bit of the ending address of the pointing error information.

With reference to the first possible implementation manner of the first aspect, in a fifth possible implementation manner, if the remaining storage space is smaller than the storage space that needs to be occupied, the method further includes: sending an interrupt notification of the monitored hardware to the operating system, wherein the interrupt notification of the monitored hardware is used for indicating the operating system to generate the error information; alternatively, the error information is written into the storage area to overwrite historical error information in the storage area.

In the embodiment of the application, under the condition that the residual storage space is smaller than the storage space which is required to be occupied by the writing of the error information, the controller can select a conventional processing mode or can write the error information into the storage area in a covering mode, and the processing mode is relatively flexible, so that the applicability of the method in practice is enhanced.

With reference to the first possible implementation manner of the first aspect, in a sixth possible implementation manner, accessing the storage area includes: and accessing the storage area according to the start address and the end address of the storage area which are pre-allocated by firmware in the electronic equipment.

In the embodiment of the application, the firmware can quickly determine the storage area by allocating the starting address and the ending address.

With reference to the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner, the start address and the end address are both physical addresses in the memory.

In the embodiment of the application, the starting address and the ending address are physical addresses in the memory, so that the controller can directly access the storage area without address offset, thereby realizing rapid and direct access to the storage area and further improving the efficiency.

With reference to the first aspect, in an eighth possible implementation manner, the controller is further connected to a corresponding register in the electronic device, and after detecting that the monitored hardware is in error, and before generating the error information, the method further includes: obtaining an error code of an error, the error information of which needs to be written into the storage area, from the register, wherein the error information of the error, which needs to be written into the storage area, is generated by calculating based on an error parameter, and the error information of the error, which does not need to be written into the storage area, is the error parameter; the error code determining the error of the monitored hardware is the same as the error code obtained from the register.

In the embodiment of the application, the judgment of whether the error information is written or not can be carried out before writing, so that the utilization efficiency of the storage area is further improved.

With reference to the first aspect, in a ninth possible implementation manner, the error information is stored in an area from a position pointed to by the write head pointer to a position pointed to by the write tail pointer in the storage area; sending an interrupt notification of the error information to the operating system, including: and generating the interrupt notification carrying the positions pointed by the write head pointer and the write tail pointer respectively, and sending the interrupt notification to the operating system.

In the embodiment of the application, the respective pointing positions of the write head pointer and the write tail pointer are carried in the interrupt notification, so that the operating system can quickly and directly position error information according to the respective pointing positions of the write head pointer and the write tail pointer, and the error information can be quickly and directly read.

In a second aspect, an embodiment of the present application provides a method for processing a hardware error, which is applied to an operating system in an electronic device, where the operating system is connected to a controller in the electronic device, and the controller is further connected to monitored hardware and a memory in the electronic device, where the method includes: receiving an interrupt notification sent by the controller based on writing error information into a storage area preset in the memory, wherein the error information is generated by the controller based on the fact that the monitored hardware is monitored to have errors; and reading the error information from the storage area according to the interrupt notification.

In the embodiment of the application, the error information is written into the preset storage area in the memory, and the notification of the operation system to read the storage area is interrupted, so that compared with the mode of the operation system polling MSR, the operation system can read the storage area without polling and the reading speed is higher, the error information can be quickly obtained, and the quick error response is realized. And the whole process is interrupted once, so that the processor resources are occupied less than the two-time interruption mode.

With reference to the second aspect, in a first possible implementation manner, the error information is stored in an area from a position pointed by a write head pointer to a position pointed by a write tail pointer in the storage area, and the interrupt notification carries positions pointed by the write head pointer and the write tail pointer; reading the error information from the storage area according to the interrupt notification, including: and reading the contained area according to the positions pointed by the write head pointer and the write tail pointer so as to read out the error information.

In an embodiment of the application, the operating system quickly and directly locates the error information based on the respective pointing positions of the write head pointer and the write tail pointer, thereby quickly and directly reading the error information.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the step of reading out the error information includes: and sequentially reading the data in each address bit in the contained area, thereby reading out the error information.

In the embodiment of the application, the read data can be prevented from being missed by taking each address bit as a unit for reading.

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, after the reading out the error information, the method further includes: and updating the position pointed by the read pointer in the storage area to point to one address bit currently read by the operating system, so that when the controller subsequently revisits the storage area, the position pointed by the write head pointer is updated to point to one address bit currently read by the operating system according to the position pointed by the updated read pointer.

In the embodiment of the application, the controller correspondingly updates the position pointed by the pointer of the write head according to the position pointed by the updated read pointer by updating the read pointer so as to ensure that the storage condition reflected by the pointer is matched with the actual storage condition in the storage area.

In a third aspect, an embodiment of the present application provides a controller, including: an error monitoring circuit, a memory area management circuit and a write memory control circuit; the error monitoring circuit is connected with the storage area management circuit and is used for being connected with monitored hardware, and the error monitoring circuit is used for indicating the storage area management circuit when the monitored hardware is monitored to have errors; the storage area management circuit is connected with the writing memory control circuit and is used for generating error information of the errors according to the received indication and judging whether the error information can be written into a preset storage area in a memory or not; and if the memory area can be written, sending the error information to the writing memory control circuit, writing the error information into the memory area by the writing memory control circuit, and sending an interrupt notification of the error information to an operating system after the error information is written into the memory area.

With reference to the third aspect, in a first possible implementation manner, the controller further includes: the register is connected with the error monitoring circuit and is used for storing error codes of errors, the error information of which needs to be written into the storage area; when the error monitoring circuit detects that an error occurs, the error monitoring circuit is used for reading the error code in the register and judging whether the error code of the error of the monitored hardware is the same as the error code in the register; and if the error codes are the same, indicating the storage area management circuit to generate the error information.

In a fourth aspect, an embodiment of the present application provides a processor, including: the system comprises a processor core and a memory controller connected with the processor core, wherein the memory controller is used for executing the hardware error processing method according to the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory and a controller connected with the memory; the controller is configured to perform a method for handling a hardware error according to the first aspect or any one of the possible implementation manners of the first aspect.

With reference to the fifth aspect, in a first possible implementation manner, an operating system is disposed in the electronic device and interacts with the controller, where the operating system is configured to execute the method for processing a hardware error according to the second aspect or any one of the possible implementation manners of the second aspect.

In a sixth aspect, embodiments of the present application provide a non-volatile computer readable storage medium storing program code which, when executed by a computer, performs a method for handling a hardware error according to the second aspect or any one of the possible implementations of the second aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a first block diagram of a controller according to an embodiment of the present application;

FIG. 3 is a second block diagram of a controller according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for processing a hardware error according to an embodiment of the present application;

FIG. 5 is a first scene graph of a storage area according to an embodiment of the application;

FIG. 6 is a second scene graph of a storage area according to an embodiment of the application;

FIG. 7 is a third scene graph of a storage area according to an embodiment of the application;

FIG. 8 is a fourth scene graph of a memory region according to an embodiment of the application;

FIG. 9 is a fifth scene graph of a memory region according to an embodiment of the application;

fig. 10 is a sixth scenario diagram of a storage area according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, an embodiment of the present application provides an electronic device 10, where the electronic device 10 may be a terminal or a server, and the terminal may be a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a Point of Sales (POS), or the like; the server may be a single server or a server group (the server group may be centralized or distributed).

By way of example, the electronic device 10 may include: a memory 11, a controller 12 connected to the memory 11, and an operating system 13 interacting with the controller 12.

In this embodiment, the Memory 11 may be a DDR (Double Data Rate), an SDRAM (Synchronous Dynamic Random-Access Memory), an RDRAM (Rambus Dynamic Random Access Memory, bus random Access Memory), or the like. In order to store error information of errors occurring in monitored hardware on the electronic device 10 or in monitored hardware on other devices, a storage area for storing the error information is preset in the memory 11.

In this embodiment, the controller 12 may be a processor, a memory controller, or may also be a DMA (Direct Memory Access ) controller, where the memory controller may be integrated into the processor, such as the memory controller being connected to a processing core in the processor as part of the processor, or may be a separate chip. When the controller 12 detects that an error occurs in the monitored hardware, it may generate error information of the error and write the error information to the storage area. At the same time, the controller 12 may also send an interrupt notification of the error information to the operating system 13 so that the operating system 13 can read the error information into the storage area according to the instruction of the interrupt notification.

It can be understood that, after the monitored hardware is in error, the error information is written into the preset storage area in the memory 11, so that the operating system 13 can acquire the error information at a faster speed when reading.

Specifically, as shown in fig. 2, the block diagram of the controller 12 may be configured such that the controller 12 includes: error monitoring circuit 121, storage area management circuit 122, and write memory control circuit 123.

The error monitoring circuit 121 is connected to the monitored hardware and the storage area management circuit 123 to monitor whether the monitored hardware transmits an error, and instructs the storage area management circuit 122 when an error is detected.

The memory area management circuit 122 is connected to the write memory control circuit 123, and when receiving the instruction from the error monitoring circuit 121, the memory area management circuit 122 generates error information and determines whether the error information can be written into the memory area. If writing to the storage area is possible, the storage area management circuit 122 sends error information to the write memory control circuit 123, and the write memory control circuit 123 writes the error information to the storage area. Further, after the error information is written into the memory area, the memory area management circuit 122 sends an interrupt notification of the error information to the operating system.

In yet a further embodiment, as shown in fig. 3, the controller 12 may further include: a register 124 connected to the error monitoring circuit 121, and an error code of an error whose error information needs to be written into the memory area is stored in the register 124. When the error monitoring circuit 121 detects an error, the error monitoring circuit 121 may read the error code in the register 124 to determine whether the error code of the error of the monitored hardware is the same as the error code in the register 124. If the error codes are the same, the storage area management circuit 122 is instructed to generate error information, and if the error codes are not the same, the storage area management circuit 122 is instructed to interrupt the operating system in a conventional manner so that the operating system itself generates error information.

In the following, by means of the method embodiment, how the controller 12, the operating system 13 and the memory 11 cooperate to implement the reading and writing of error information will be described in detail.

Referring to fig. 4, an embodiment of the present application provides a method for processing a hardware error, where the method may be cooperatively executed by a controller 12 and an operating system 13 in an electronic device 10, and a main flow of the method for processing a hardware error may include:

step S100: when the controller monitors that the monitored hardware is in error, error information of the error is generated.

Step S200: the controller writes the error information into a preset storage area in the memory.

Step S300: the controller sends an interrupt notification of the error message to the operating system.

Step S400: the operating system receives the interrupt notification sent by the controller.

Step S500: and the operating system reads the error information written by the controller from a preset storage area of the memory according to the interrupt notice.

The above-mentioned flow will be described in detail with reference to specific application scenarios.

The controller 12 may be connected to monitored hardware to monitor the operation of the monitored hardware in real time, where the monitored hardware may be conventional hardware such as an input/output system, a motherboard, and the memory 11. Once an error occurs in the monitored hardware, the controller 12 may determine, through interaction with the monitored hardware, a parameter when the monitored hardware is in error and an error code of the error, where the error code is used to indicate a type of the error, and different error types of different hardware are corresponding to different error codes, for example, for a system bus error, an ECC error of the memory 11, a parity error, a Cache error, a TLB error, and the like, where the error codes corresponding to the error codes are different from each other.

It will be appreciated that the controller 12 determines parameters and error codes through interaction with the hardware being monitored as is conventional in the art, and that this embodiment is not described in detail to avoid tiredness.

After the controller 12 determines the parameter at the time of transmitting the error, the controller 12 may directly generate the corresponding error information according to the parameter as a way. Alternatively, the controller 12 may determine whether the error message needs to be generated, and if so, generate the corresponding error message according to the parameter; otherwise, the controller 12 interrupts the operating system 13 in a conventional manner so that the operating system 13 itself generates the error information. The following description will be given respectively.

Regarding the manner in which the corresponding error information is generated directly from the parameters:

in this embodiment, the manner of generating the error information is different for different types of errors. For some types of errors, the parameters of the errors reflect the specific situations of the errors, so that the parameters can be directly used as error information; for example, for a parity error, since the parameter at the time of occurrence of the parity error can reflect the specific case of the parity error, the parameter at the time of occurrence of the parity error can be regarded as error information. For other types of errors, the parameters cannot reflect the specific situation of the error, so that some intermediate operations are needed to be carried out on the parameters to obtain error information capable of reflecting the specific situation of the error; for example, for an ECC error of the memory 11, whether the error is rank, bank, or row of the memory 11 cannot be reflected by using the parameter of the ECC error, so the controller 12 needs to calculate the address of the error in rank, bank, or row by using the system address in the parameter to determine the specific location of the error in the memory 11, and the calculated address is the final error information.

Regarding the manner in which the corresponding error information is judged and then generated:

in this embodiment, the error code of the error whose error information needs to be written into the storage area is stored in the register of the controller 12 itself in advance; the error information of the error to be written into the storage area is generated based on the parameter operation of the error, for example, the error information of the ECC error of the memory 11 is required to be written into the storage area; error information of errors that do not need to be written into the storage area is an error parameter, such as the error information of parity errors described above, does not need to be written into the storage area. Then, when the error code of the error of the monitored hard disk is acquired, the controller 12 can acquire the error code of the error of which the error information needs to be written into the storage area from the register, and judge whether the error code of the error of the monitored hard disk is the same as the error code acquired from the register. If the error codes are the same, corresponding error information is generated, and the error information is written into the storage area; if the error codes are different, the controller 12 interrupts the operating system 13 in a conventional manner, that is, the controller 12 can write the error parameters and error codes of the monitored hardware into the MSR defined by the MCA, and then send a conventional interrupt notification of the monitored hardware to the operating system 13, so that the operating system 13 can poll and read the MSR according to the interrupt notification, thereby obtaining the error parameters and error codes, and generating corresponding error information according to the error parameters and error codes.

In this embodiment, in order to facilitate the writing of error information by the controller 12, a storage area for storing error information is preset in the memory 11.

For example, the storage area may be a ring buffer, where the storage area may be allocated by firmware in a processor of the electronic device 10, that is, the firmware determines a start address and an end address of the storage area in the memory 11 according to a preset rule, where the start address and the end address are physical addresses in the memory 11, and an area from the start address to the end address in the memory 11 that includes a plurality of address bits is the storage area.

In addition, to facilitate determining the read condition and the storage condition of the storage area, the firmware may further set a write head pointer and a write tail pointer in the storage area for representing the storage condition of the storage area, and set a read pointer for representing the read condition of the storage area; the position pointed by the write head pointer is one address bit occupied by the data written first in the storage area, the position pointed by the write tail pointer is one address bit occupied by the data written last in the storage area, and the position pointed by the read pointer is one address bit read currently in the storage area.

It will be appreciated that since the storage area has not been stored yet because of the just allocated storage area, when setting the write head pointer, the write tail pointer, and the read pointer for that storage area, the firmware may choose one address bit from a plurality of address bits for the storage area and set the write head pointer, the write tail pointer, and the read pointer to each address bit.

Finally, the firmware may send the start address and the end address of the memory region to the controller 12 so that the subsequent controller 12 may access the memory region based on the start address and the end address.

It should be noted that, in practical application, the user may adjust the rule in the firmware according to the actual requirement, so as to correspondingly adjust the allocation of the start address and the end address, and further correspondingly adjust the size of the allocated ring buffer area, so that the size of the ring buffer area can meet the actual requirement. For example, if a larger ring buffer is needed, the rule is adjusted to make the distance between the determined start address and the determined end address longer; otherwise, if a smaller ring buffer is needed, the rule is adjusted to make the distance between the determined start address and the determined end address smaller.

The technical solution of the present application will be described in more detail by means of an assumption.

As shown in fig. 5, assume that: the starting address is AAXX, the ending address is BBXX, and 8 address bits are separated between the starting address AAXX and the ending address BBXX, so that the allocated storage area is a ring buffer area containing 10 address bits. In fig. 5, the address bits being white indicates that no data is stored in the address bits, and the address bits being gray indicates that data is stored in the address bits, and it is apparent that in fig. 5, 10 address bits are all white indicating that no data is stored in the just allocated memory area, so the firmware randomly selects the 5 th address bit, and sets the write head pointer, the write tail pointer, and the read pointer to all of the 5 th address bits. Finally, the firmware sends the start address AAXX and the end address BBXX to the controller 12 for the controller 12 to access.

In this embodiment, after the controller 12 generates the error information of the current error, since the storage area may also store other error information, the controller 12 needs to access the storage area to determine whether the remaining storage space in the storage area is greater than or equal to the storage space that the error information needs to occupy for writing.

As an example way of realizing the judgment by access whether or not the remaining memory space in the memory area is equal to or larger than the memory space that the error information writing needs to occupy:

the controller 12 sends the start address and the end address to the controller 12 in advance to access the storage area, wherein the start address and the end address are physical addresses in the memory 11, so that the controller 12 can directly access the storage area quickly and directly by using the start address and the end address without performing address conversion. And because the storage area is provided with the write head pointer and the write tail pointer for representing the storage condition of the storage area, the controller 12 can acquire the positions pointed by the write head pointer and the write tail pointer in the storage area through accessing the storage area, and determine the rest storage space in the storage area according to the positions pointed by the write head pointer and the write tail pointer in the storage area. Thus, the controller 12 can determine whether the remaining memory space in the memory area is equal to or greater than the memory space that the error information writing needs to occupy.

Referring to fig. 5, the foregoing assumptions are continued: when the hardware a generates the error A1, the controller 12 may generate the error information of the error A1 as XXX, and determine that the content of the error information XXX itself needs to occupy 3 address bits. When the error information XXX is stored in the storage area, the start address of the error information XXX needs to be written in one address bit adjacent to the address bit where the first bit data of the error information XXX is located in the storage area, and the end address of the error information XXX needs to be written in one address bit adjacent to the address bit where the last bit data of the error information XXX is located in the storage area, so that the storage space required to be occupied by writing the error information XXX is 2 address bits which are occupied by adding the start address and the end address of the error information XXX on the basis of 3 address bits, and the total number of the address bits is 5.

Further, the controller 12 obtains the write head pointer and the write tail pointer in the storage area by accessing the storage area to point to the 5 th address bit. Based on the write head pointer and the write tail pointer pointing to the 5 th address bit, the controller 12 may determine that the remaining memory space in the memory area is 10 address bits. Thus, by comparing the remaining memory space of 10 address bits with the memory space of 5 address bits that the writing of the error information XXX needs to occupy, the controller 12 determines that the error information XXX can be written into the memory area.

In this embodiment, if the controller 12 determines that the remaining storage space in the storage area is greater than or equal to the storage space that the error information needs to be written into, the controller 12 writes the error information into the remaining storage space, that is, the controller 12 writes the content of the error information into the corresponding address bit in the remaining storage space, writes the start address of the error information into the corresponding one address bit in the remaining storage space, and writes the end address of the error information into the corresponding other address bit in the remaining storage space, where the corresponding one address bit is adjacent to the address bit where the first data of the error information is located, and the corresponding other address bit is adjacent to the address bit where the last data of the error information is located.

It will be appreciated that if there is no other error information stored in the memory area during the writing of the error information, the controller 12 may randomly select an address bit as the start of the writing, and write the error information, or the controller 12 may write the error information by using an address bit pointed by both the write head pointer and the write tail pointer as the start of the writing. If there are other error information stored in the memory area during writing of the error information, the controller 12 may select one address bit adjacent to the other error information stored in the memory area in the remaining memory space as the start of the writing, and write the error information to the memory area in order to realize efficient and orderly utilization of the memory space.

Further, after the controller 12 writes the error information, the controller 12 may update the location pointed by the write tail pointer in the storage area to an address bit where the end address pointed by the error information is located, and then generate an interrupt notification carrying the locations pointed by the write head pointer and the write tail pointer, and send the interrupt notification to the operating system 13, so as to instruct the operating system 13 to read the error information in the storage area according to the locations pointed by the write head pointer and the write tail pointer in the interrupt notification.

Referring to fig. 5 to 7, the above assumption is continued: it is first known from the foregoing description that the memory area including 10 address bits has not stored other error information before storing the error information XXX, so the controller 12 may write the error information XXX with the 5 th address bit pointed to by both the write head pointer and the write tail pointer as the start of the present writing, and update the write tail pointer to point to the 9 th address bit, thereby obtaining the memory area shown in fig. 6. Finally, the controller 12 regenerates an interrupt notification carrying the write head pointer to the 5 th address bit and the write tail pointer to the 9 th address bit and sends it to the operating system 13.

After that, when the B hardware generates the error B1, the controller 12 may generate the error information YYY of the error B1, determine that the content of the error information YYY needs to occupy 3 address bits, and further determine that the writing of the error information YYY into the storage area needs to occupy 5 address bits. The controller 12 continues to access the memory area, and if the last written error information XXX has not been read out at the operating system 13, the controller 12 obtains that the write head pointer points to the 5 th address bit and the write tail pointer points to the 9 th address bit. In this way, the controller 12 can determine the remaining memory space in the memory region as the 5 address bits from the 10 th address bit to the 4 th address bit. The controller 12 determines that the error information YYY can also be written into the memory area by comparing the remaining memory space of 5 address bits with the memory space of 5 address bits that the error information YYY needs to be written into. Further, the controller 12 starts writing this time with the 10 th address bit adjacent to the write tail pointer, writes the error information YYY, and updates the write tail pointer to point to the 4 th address bit, thereby obtaining the storage area shown in fig. 7. Finally, the controller 12 regenerates an interrupt notification carrying the write head pointer to the 5 th address bit and the write tail pointer to the 4 th address bit and sends it to the operating system 13.

Returning to the present embodiment, if the controller 12 determines that the remaining storage space in the storage area is smaller than the storage space that the error information writing needs to occupy by judgment. As one way, the controller 12 also interrupts the operating system 13 in a conventional manner, that is, the controller 12 may write the parameter of the error occurring in the monitored hardware and the error code into the MSR defined by the MCA, and then send a conventional interrupt notification of the monitored hardware to the operating system 13, so that the operating system 13 polls and reads the MSR according to the interrupt notification, thereby obtaining the parameter of the error and the error code, and generates corresponding error information according to the parameter of the error and the error code. Alternatively, the controller 12 may write error information into the memory area and overwrite the history error information previously stored in the memory area; for example, the controller 12 may sequentially select the historical error information in the storage area in order of the time of the storing from the first to the last until the space occupied by the selected historical error information in the storage area is greater than or equal to the space that the current error information writing needs to occupy, and then the controller 12 writes the current error information to cover the selected historical error information.

Referring to fig. 7 and 8, the foregoing assumptions are continued: after the error B1 occurs in the B hardware, if the error C1 occurs in the C hardware, the controller 12 may generate the error information of the error C1 as ZZ, determine that the content of the error information ZZ needs to occupy 2 address bits, and further determine that the writing of the error information ZZ into the storage area needs to occupy 4 address bits. The controller 12 continues to access the memory area, and if the last written error information XXX and last written error information YYY have not been read out at the operating system 13, the controller 12 acquires that the write head pointer points to the 5 th address bit and the write tail pointer points to the 4 th address bit. Since the write head pointer points to a location adjacent to the location at which the write tail pointer points, the controller 12 can determine that the storage area is full. If the error information ZZ still needs to be written, the controller 12 may first select the error information XXX in the storage area in order of the time of the storing from first to last. Since the error information XXX occupies 5 address bits, which is larger than the 4 address bits that the error information YY needs to occupy for writing, the controller 12 writes the error information ZZ to overwrite the selected error information XXX and updates the write tail pointer to point to the 3 rd address bit, thereby obtaining the memory area shown in fig. 8. The controller 12 regenerates an interrupt notification carrying the write head pointer to the 5 th address bit and the write tail pointer to the 3 rd address bit and sends it to the operating system 13.

For the operating system 13, the user can set the processing strategy of the operating system 13 on the received interrupt notification sent by the controller 12 according to the actual application requirement; for example, the processing policy may be to instruct the operating system 13 to read corresponding error information in real time based on the interrupt notification every time the interrupt notification is received; for another example, the processing policy may instruct the operating system 13 to determine whether an interrupt notification sent by the controller 12 is received at a set time point of each cycle, and after determining that the interrupt notification is received, read corresponding error information based on the latest received interrupt notification.

Further, when the operating system 13 processes the interrupt notification, the operating system 13 can obtain the positions pointed by the write head pointer and the write tail pointer carried in the interrupt notification through analyzing the interrupt notification. Based on the positions pointed by the write head pointer and the write tail pointer, the operating system 13 can determine that the error information is stored in the storage area from the position pointed by the write head pointer to the area included in the position pointed by the write tail pointer, and read the data in each address bit in the area according to the positions pointed by the write head pointer and the write tail pointer, thereby reading the error information.

It will be appreciated that, since the write head pointer points to a location in the storage area containing more than one error information in the area memory 11, each error information is written into the storage area in the format of "the start address of the error information + the content of the error information + the end address of the error information", the operating system 13 can distinguish the error information read out by itself based on the format of the storage area when reading the area.

The operating system 13 also updates the location in the memory area pointed to by the read pointer to point to an address bit that is currently read by the operating system 13. In this way, when the controller 12 subsequently revisits the storage area, the controller 12 can update the position pointed by the write head pointer to one address bit currently read by the operating system 13 according to the position pointed by the read pointer after updating, thereby realizing synchronous updating of the pointer and avoiding errors.

It should also be understood that the update of the location pointed by the read pointer by the operating system 13 may be that, each time the operating system 13 reads data in one address bit, the location pointed by the read pointer is updated to one address bit of the data currently read by the operating system; alternatively, each time the operating system 13 reads an error message, the position pointed by the read pointer may be updated to an address bit corresponding to the end address of the error message currently read; alternatively, when the operating system 13 reads all the error messages this time, the position pointed by the read pointer may be updated to one address bit corresponding to the end address of the error message that was read last.

Referring to fig. 8 to 10, the above assumption is continued: if the operating system 13 adopts a periodic processing manner, at a set time point of the current period, the operating system 13 determines that 3 interrupt notifications sequentially sent by the controller 12 are received by itself through judgment, and processes the interrupt notification received last time, that is, processes the interrupt notification carrying the write head pointer to the 5 th address bit and the write tail pointer to the 3 rd address bit. Based on the 5 th address bit pointed to by the head pointer and the 3 rd address bit pointed to by the write tail pointer, the operating system 13 may read the region of the 5 th address bit pointed to by the head pointer to the 9 address bits contained in the 3 rd address bit pointed to by the write tail pointer. By sequentially reading out the data contained in each of the 9 address bits, the operating system 13 can read out the error information YYY first and then the error information ZZ. And, if the update of the read pointer is performed in units of error information, when the operating system 13 reads the error information YYY, the operating system 13 may update the read pointer to point to the 9 th address bit, thereby obtaining the storage area shown in fig. 9; while the operating system 13 reads the error information ZZ, the operating system 13 can update the read pointer to point to the 3 rd address bit, thereby obtaining the memory area shown in fig. 10. When the subsequent controller 12 revisits the storage area, the write head pointer is also updated to point to the 3 rd address bit, so that the position pointed to by the write head pointer coincides again with the position pointed to by the write tail pointer, thereby indicating that no error information is stored in the storage area.

Some embodiments of the present application also provide a computer readable storage medium of computer executable non-volatile program code, where the storage medium can be a general purpose storage medium, such as a removable disk, a hard disk, etc., and the computer readable storage medium stores program code thereon, where the program code when executed by a computer performs the steps of the method for handling a hardware error according to any of the above embodiments.

The program code product of the method for processing a hardware error provided in the embodiment of the present application includes a computer readable storage medium storing program code, and instructions included in the program code may be used to execute the method in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

In summary, by writing the error information into the preset storage area in the memory and interrupting the notification to the operating system to read the storage area, compared with the method of polling the MSR, the operating system can read the storage area without polling and the reading rate is higher, so that the error information can be quickly obtained, thereby realizing quick error response.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The method for processing the hardware error is characterized by being applied to a controller in electronic equipment, wherein the controller is connected with monitored hardware and a memory in the electronic equipment, and an operating system is deployed in the electronic equipment, and the method comprises the following steps:

generating error information of the error when the monitored hardware is monitored to generate the error;

writing the error information into a storage space contained in a preset storage area in the memory;

and sending an interrupt notification of the error information to the operating system, wherein the interrupt notification is used for indicating the operating system to read the error information into the storage area.

2. The method for processing a hardware error according to claim 1, wherein writing the error information into a storage space included in a storage area preset in the memory includes:

and determining that the residual storage space in the storage area is larger than or equal to the storage space which is required to be occupied by the writing of the error information, and writing the error information into the residual storage space.

3. The method according to claim 2, wherein determining that the remaining storage space in the storage area is equal to or larger than the storage space that the error information writing is required to occupy, comprises:

Accessing the storage area to obtain the positions pointed by the write head pointer and the write tail pointer in the storage area respectively;

determining the remaining storage space according to the position;

and determining that the remaining storage space is larger than or equal to the storage space needing to be occupied.

4. A method of handling a hardware error according to claim 3, wherein writing the error information to the remaining memory space comprises:

writing the error information into corresponding address bits in the residual storage space, writing a start address of the error information into corresponding one address bit in the residual storage space, and writing an end address of the error information into corresponding another address bit in the residual storage space, wherein the corresponding one address bit is adjacent to the address bit of the first bit data of the error information, and the corresponding another address bit is adjacent to the address bit of the last bit data of the error information.

5. The method for processing a hardware error according to claim 4, wherein after writing the error information into a storage space included in a storage area preset in the memory, the method further comprises:

And updating the position pointed by the write tail pointer to an address bit of the ending address pointed to the error information.

6. The method of claim 2, wherein if the remaining storage space is smaller than the storage space to be occupied, the method further comprises:

sending an interrupt notification of the monitored hardware to the operating system, wherein the interrupt notification of the monitored hardware is used for indicating the operating system to generate the error information; or,

the error information is written into the storage area to overwrite historical error information in the storage area.

7. The method of processing a hardware error according to claim 2, wherein accessing the memory area comprises:

and accessing the storage area according to the start address and the end address of the storage area which are pre-allocated by firmware in the electronic equipment.

8. The method for processing a hardware error according to claim 7, wherein,

the start address and the end address are physical addresses in the memory.

9. The method of claim 1, wherein the controller is further coupled to a corresponding register in the electronic device, wherein after detecting an error in the monitored hardware, and before generating the error information, the method further comprises:

Obtaining an error code of an error, the error information of which needs to be written into the storage area, from the register, wherein the error information of the error, which needs to be written into the storage area, is generated by calculating based on an error parameter, and the error information of the error, which does not need to be written into the storage area, is the error parameter;

the error code determining the error of the monitored hardware is the same as the error code obtained from the register.

10. The method according to claim 1, wherein the error information is stored in an area from a position pointed to by a write head pointer to a position pointed to by a write tail pointer in the storage area; sending an interrupt notification of the error information to the operating system, including:

and generating the interrupt notification carrying the positions pointed by the write head pointer and the write tail pointer respectively, and sending the interrupt notification to the operating system.

11. A method for processing a hardware error, the method being applied to an operating system in an electronic device, the operating system being connected to a controller in the electronic device, the controller being further connected to monitored hardware and a memory in the electronic device, the method comprising:

Receiving an interrupt notification sent by the controller based on writing error information into a storage space contained in a storage area preset in the memory, wherein the error information is generated by the controller based on the fact that the monitored hardware is monitored to have errors;

and reading the error information from the storage area according to the interrupt notification.

12. The method according to claim 11, wherein the error information is stored in an area included from a position pointed to by a write head pointer to a position pointed to by a write tail pointer in the storage area, and the interrupt notification carries the positions pointed to by the write head pointer and the write tail pointer, respectively; reading the error information from the storage area according to the interrupt notification, including:

and reading the contained area according to the positions pointed by the write head pointer and the write tail pointer so as to read out the error information.

13. The method of claim 12, wherein the step of reading out the error information comprises:

and sequentially reading the data in each address bit in the contained area, thereby reading out the error information.

14. The method for processing a hardware error according to claim 13, wherein after reading out the error information, the method further comprises:

and updating the position pointed by the read pointer in the storage area to point to one address bit currently read by the operating system, so that when the controller subsequently revisits the storage area, the position pointed by the write head pointer is updated to point to one address bit currently read by the operating system according to the position pointed by the updated read pointer.

15. A controller, comprising: an error monitoring circuit, a memory area management circuit and a write memory control circuit;

the error monitoring circuit is connected with the storage area management circuit and is used for being connected with monitored hardware, and the error monitoring circuit is used for indicating the storage area management circuit when the monitored hardware is monitored to have errors;

the storage area management circuit is connected with the writing memory control circuit and is used for generating error information of the errors according to the received indication and judging whether the error information can be written into a preset storage area in a memory or not; and if the memory area can be written, sending the error information to the writing memory control circuit, writing the error information into a memory space contained in the memory area by the writing memory control circuit, and sending an interrupt notification of the error information to an operating system after the error information is written into the memory space contained in the memory area.

16. The controller of claim 15, wherein the controller further comprises: the register is connected with the error monitoring circuit and is used for storing error codes of errors, the error information of which needs to be written into the storage area;

when the error monitoring circuit detects that an error occurs, the error monitoring circuit is used for reading the error code in the register and judging whether the error code of the error of the monitored hardware is the same as the error code in the register; and if the error codes are the same, indicating the storage area management circuit to generate the error information.

17. A processor, comprising: a processor core, a memory controller coupled to the processor core, the memory controller configured to perform the method of handling hardware errors according to any of claims 1-10.

18. An electronic device, comprising: a memory and a controller connected with the memory; the controller is configured to perform the method for handling hardware errors according to any of claims 1-10.

19. The electronic device of claim 18, wherein an operating system is disposed in the electronic device and interacts with the controller, the operating system configured to perform the method of handling hardware errors of any of claims 11-14.

20. A non-volatile computer readable storage medium, characterized in that a program code is stored, which when run by a computer performs the method of handling hardware errors according to any of claims 11-14.