CN114518972A

CN114518972A - Memory error processing method and device, memory controller and processor

Info

Publication number: CN114518972A
Application number: CN202210132997.0A
Authority: CN
Inventors: 宋明辉; 洪佳华; 曾峰
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2022-05-20
Anticipated expiration: 2042-02-14
Also published as: CN114518972B

Abstract

The embodiment of the application provides a memory error processing method, a memory error processing device, a memory controller and a processor, wherein the memory error processing method comprises the steps of obtaining an error signal of a memory error, wherein the error signal comprises type information of the memory error and an error memory identifier of the memory error; when the memory error can be corrected according to the data check code of the memory error or a recovery command corresponding to the memory error is determined in a pre-stored recovery command sequence according to the type information of the error signal, determining the memory in which the memory error occurs according to the error memory identifier and all pre-stored memory information, and correcting the memory or sending the recovery command to the memory according to the data check code. The memory error processing method and the memory error processing device can improve the memory error processing efficiency.

Description

Memory error processing method and device, memory controller and processor

Technical Field

The embodiment of the application relates to the field of computers, in particular to a memory error processing method and device, a memory controller and a processor.

Background

With the increase of data transmission rate of computer memory and internal working frequency of chip, the error probability of memory command and data signal in transmission process is also greatly increased, and usually, the error detection is carried out to determine whether the memory is error by data check code. The memory errors are generally divided into recoverable errors and unrecoverable errors according to whether the data check codes have an error correction function, when the errors are recoverable errors, the error data can be directly corrected into correct data according to the error correction function of the data check codes (error signals), and when the errors are unrecoverable errors, the reset of the SOC system can only be restarted.

In the prior art, a system software or a baseboard management controller is usually used for determining a memory location with an error and processing a memory error, but the intervention of the system software and the baseboard management controller slows down the correction speed of a recoverable error, and meanwhile, because the restart reset time of the SOC system is longer, namely the processing speed of an unrecoverable error is slower, the prior art has the problem of lower memory error processing efficiency.

Therefore, how to improve the memory error processing efficiency becomes a technical problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a memory error processing method, a memory error processing apparatus, a memory controller, and a processor, so as to continuously monitor a bandwidth of a data stream.

In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:

in a first aspect, an embodiment of the present application provides a memory error processing method, including:

acquiring an error signal of a memory error, wherein the error signal comprises type information of the memory error and an error memory identifier of the memory error;

when the memory error can be corrected according to the data check code of the memory error or a recovery command corresponding to the memory error is determined in a pre-stored recovery command sequence according to the type information of the error signal, determining the memory in which the memory error occurs according to the error memory identifier and all pre-stored memory information, and correcting the memory or sending the recovery command to the memory according to the data check code.

In a second aspect, an embodiment of the present application provides a memory error processing apparatus, including:

the error signal acquisition module is used for acquiring an error signal of the memory error, wherein the error signal comprises the type information of the memory error and the error memory identifier of the memory error;

and the memory error processing module is suitable for determining a memory with the memory error according to the error memory identifier and all pre-stored memory information when the memory error can be corrected according to the data check code of the memory error or a recovery command corresponding to the memory error is determined in a pre-stored recovery command sequence according to the type information of the error signal, and correcting the memory according to the data check code or sending the recovery command to the memory.

In a third aspect, an embodiment of the present application provides a memory controller, which includes the memory error handling apparatus as described in the second aspect

In a fourth aspect, an embodiment of the present application provides a processor including the memory error handling apparatus according to the second aspect.

The memory error processing method provided by the embodiment of the application is suitable for a memory controller, and when the memory error can be corrected according to the data check code of the memory error or a recovery command corresponding to the memory error is determined in a pre-stored recovery command sequence according to the type information of the error signal, the memory with the memory error is determined according to the error memory identifier and all pre-stored memory information, and then the memory is corrected according to the data check code or the recovery command is sent to the memory, so that the error in the memory is corrected or recovered to a normal working state from an error state.

It can be seen that, in the memory error processing method provided in the embodiment of the present application, by pre-storing all the memory information in the memory controller, the memory in which the corresponding memory error occurs can be determined according to the error memory identifier of the memory error, so that when it is determined that the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, the corresponding memory can be directly corrected according to the data check code; in addition, a recovery command corresponding to the type of the memory error can be stored in the memory controller in advance, so that when the recovery command corresponding to the memory error is determined in the pre-stored recovery command sequence, the recovery of the memory from the error state to the normal working state can be realized by using the corresponding recovery command, further, the memory error which cannot be directly corrected according to the error signal can be recovered without restarting the SoC. It can be seen that the whole processing process mainly relates to a memory controller and a memory, the transmission path of signals is short, the processing logic is simple, the recovery time of the memory from an error state entering due to memory errors to a normal working state is short, and the memory error processing efficiency can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a system diagram of a memory error handling method;

fig. 2 is a block diagram of a processor to which the memory error processing method according to the embodiment of the present disclosure is applied;

FIG. 3 is a block diagram of another processor to which the memory error handling method of the present application is applied;

FIG. 4 is a flowchart illustrating a memory error handling method according to an embodiment of the present disclosure;

fig. 5 is another flowchart of a memory error handling method according to an embodiment of the present disclosure;

fig. 6 is a flowchart of a memory error handling method according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a memory error handling method according to an embodiment of the present application;

fig. 8 is a block diagram of a memory error processing apparatus according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The following describes a conventional error handling method.

Referring to fig. 1, fig. 1 is a system block diagram of a memory error handling method.

As shown in fig. 1, the system of the memory error handling method mainly includes a memory 110, a Central Processing Unit (CPU)120, and a Baseboard Management Controller (BMC)130, wherein the CPU120 is connected to the memory 110 and the BMC130, respectively.

The CPU120 detects whether the memory 110 generates a correctable error, and the CPU120 registers store various parameters corresponding to the memory slot positions, that is, the in-place information of the memory is configured according to the physical address, the logical address and the linear address of the memory, and the DIMM slot position of the corresponding memory, or the logical address and the physical address on the DIMM slot position can be found according to the in-place information. The BMC130 stores an alarm program which is configured according to the in-place information and is realized according to the BMC130 code, after the BMC130 reads the in-place information of the memory in the register of the CPU120, the alarm program analyzes the in-place information so as to locate the memory with correctable errors, and indicates a slot number corresponding to the memory according to the in-place information.

Furthermore, after the alarm program analyzes the in-place information so as to position the memory with correctable errors, the alarm program can inform a user of the errors in a notification mode, and can update and replace the error data in the error memory according to the correct data obtained by directly correcting the errors through the error correction function of the data check code.

It should be noted that, since the prior art cannot correct the uncorrectable error, only the system on chip (SoC) can be restarted, so the correctable error that can directly correct the error data into correct data through the data check code (error signal) is called recoverable error, and the uncorrectable error is called unrecoverable error. However, when the embodiment of the present application processes an uncorrectable error, an uncorrectable error in the prior art can be recovered without restarting the SoC, so the recoverable error described herein includes a correctable error and an uncorrectable error that can be recovered by the embodiment of the present application, that is, all errors that can be recovered by the memory error processing method provided by the embodiment of the present application.

However, the existing method for processing the correctable memory error through the participation of the BMC increases the complexity of the practical application, specifically, the error signal is collected by the CPU120 first, and the CPU120 notifies the BMC130 to process the error through the register, which increases the processing delay of the error, and the CPU120 cannot access the error memory before the memory error processing is finished, thereby affecting the system performance.

In addition to handling correctable errors by the BMC130, correctable errors may also be handled directly by the operating system, but if so, traffic data bandwidth may be occupied, continuity of computer services may be disrupted, and operating system involvement may further increase memory error handling time.

Moreover, for the data that cannot be corrected directly by the data check code (error signal), the prior art can only restart the SoC, and the SoC restart time is longer than the processing time for correcting the error, so that the processing efficiency of the memory error which is not high originally is further reduced.

In order to solve the foregoing problems, embodiments of the present application provide a memory error handling method. For convenience of description, a processor to which the memory error handling method provided in the embodiments of the present application is applied will be first introduced. Referring to fig. 2 and fig. 3, fig. 2 is a block diagram of a processor to which the memory error processing method according to the embodiment of the present disclosure is applied, and fig. 3 is a block diagram of another processor to which the memory error processing method according to the embodiment of the present disclosure is applied.

As shown in fig. 2, the processor used in the memory error processing method provided in the embodiment of the present application includes device modules such as a CPU core 210, a memory controller 220, and a microprocessor 230, and each device module transmits a control signal through a control network and transmits data information through a data network, thereby completely implementing the function of the processor.

The starting point or the ending point of the memory access in the processor is the CPU core 210, and if directly interacting with the memory, the memory controller 220 is the memory controller 220, and the memory controller 220 sends memory commands such as memory access commands and memory refresh commands to the memory through an internal protocol conversion module or receives control information such as memory read data or memory error signals from the memory. Therefore, the memory controller is a device module that first obtains a memory error signal in the processor after a memory error occurs, and specifically, the memory controller 220 checks the memory error by using the check code when reading the memory data from the memory to obtain an error signal of the memory error, or receives the memory error found by itself and fed back by the memory to obtain an error signal of the memory error.

Microprocessor 230 is used to initialize the registers of memory controller 220, to initialize the recovery command sequence for memory controller 220, and to reset the memory subsystem that includes memory controller 220.

As shown in fig. 3, the system of the memory error handling method includes a microprocessor 230, a system management unit 240 and a memory controller 220, wherein the memory controller 220 includes the modules shown in fig. 3 except the microprocessor 230 and the system management unit 240.

As previously described, microprocessor 230 may initialize the registers of memory controller 220, initialize the recovery command sequence for memory controller 220, and reset the memory subsystem; specifically, the registers of memory controller 220 and the recovery command sequence of memory controller 220 may be initialized at system startup; and restarting or resetting the memory subsystem when the memory error is not correctable according to the data check code of the memory error and no recovery command corresponding to the memory error exists in a pre-stored recovery command sequence.

The system management unit 240 is configured to parse the access operation of the microprocessor 230, and is further responsible for sending a signal generated by the memory controller 220 to the microprocessor 230 through the interface when the memory error is not correctable according to the data check code of the memory error and there is no recovery command corresponding to the memory error in the pre-stored recovery command sequence 340, so that the microprocessor 230 restarts or resets the memory subsystem.

Memory controller 220 specifically includes: an ECC module 2201, a read-write data queue 2202, a CRC module 2203, an error signal acquisition module 310, a memory error handling module 320, a retransmit command module 330, a recovery command sequence 340, an error reporting module 2204, a refresh control module 2205, an arbitration module 2206, a parity module 2207, and a DFI protocol conversion module 2208. Wherein:

the ECC module 2201 is responsible for calculating an ECC check code of write data, performing ECC check on read data, and reporting an error signal of a read data ECC error to the error signal acquisition module 310 when the read data ECC check error occurs;

a read-write data queue 2202 for temporarily storing write data and ECC check codes sent by the ECC module, and for temporarily storing read data and read ECC check codes;

a CRC module 2203, configured to implement CRC check code calculation of write data and CRC check of read data, and report an error signal of the read data CRC error to the error signal acquisition module 310 when the read data CRC error is detected;

the DFI protocol conversion module 2208 performs DFI protocol conversion on the arbitrated command, sends the converted command to a DFI interface, and reports an error signal of a memory error to the error signal acquisition module 310;

an error signal obtaining module 310, adapted to obtain an error signal of a memory error, where the error signal includes type information of the memory error and an error memory identifier of the memory error;

the recovery command sequence 340 is adapted to store recovery commands corresponding to the type information of the respective error signals, which are initialized by the microprocessor 230 at system start-up, as described above.

A memory error processing module 320, adapted to determine, according to the type information of the error signal acquired by the error signal acquiring module 310, whether the memory error can be corrected according to the data check code of the memory error or determine whether a recovery command corresponding to the memory error exists in a pre-stored recovery command sequence 340, when the corresponding recovery command can be corrected or exists, determine a memory in which the memory error occurs according to the error memory identifier and all pre-stored memory information, and correct the memory according to the data check code or send the recovery command to the memory;

the retransmit command module 330 includes two sub-queues, a non-refresh command queue and a refresh command queue.

Wherein:

the non-refresh command queue stores commands to the memory according to the error type of the memory error received by the memory controller, and when the retransmission command queue receives a feedback signal of successful execution of the corresponding command, indicating that the command has been successfully executed, the non-refresh command queue removes the command from the queue.

The refresh command queue stores refresh commands issued by the refresh control module in the map, and in order not to violate the refresh cycle time of the memory, the commands in the refresh command queue have a higher priority than the commands in the retransmit command queue.

The error reporting module 2204 is adapted to generate a signal and send the signal to the system management unit 240 when the memory error is a recovery command that cannot be corrected according to the data check code of the memory error and that does not correspond to the memory error in the pre-stored recovery command sequence.

A refresh control module 2205 for initiating refresh commands periodically to maintain data in the memory (DRAM).

The arbitration module 2206 is adapted to arbitrate the command sent to the memory.

A parity module 2207 adapted to parity the command or address.

With reference to fig. 3, a detailed description is given below of a memory error handling method provided in an embodiment of the present application after understanding basic modules of a memory controller provided in an embodiment of the present application, specifically referring to fig. 4, where fig. 4 is a flowchart of the memory error handling method provided in the embodiment of the present application.

As shown in fig. 4, the memory error handling method provided in the embodiment of the present application is applicable to a memory controller, and may specifically include the following steps:

in step S310, an error signal of a memory error is obtained.

When a memory error is handled, an error signal of the memory error needs to be acquired first.

It is to be understood that the error signal of acquiring a memory error described herein means that when a memory error occurs, the error signal of acquiring a memory error is acquired, and when no memory error occurs, the error signal of acquiring a memory error is not acquired.

Specifically, the error signal includes type information of the memory error and an error memory identifier of the memory error, where the type information is used to subsequently determine a processing manner of the corresponding memory error, and the error memory identifier is used to determine which memory specifically has the error, and determine an object to be subsequently modified or restored.

Based on the source of the memory error, as mentioned above, the error signal of the memory error obtained by the memory controller may include a read data ECC error signal, a read data CRC error signal, a write data CRC error signal, a command or address parity error, wherein, for the read data ECC error signal and the read data CRC error signal, a hardware module (ECC module 2201 and CRC module 2203) performing the corresponding check code checking function is provided in the memory controller.

For each type of error signal, an error signal connection line for connecting a hardware module generated by the corresponding error signal can be preset and configured, and when an error occurs, an effective signal is generated on the error signal connection line, so that the corresponding error signal is acquired.

Such as: for the error reporting signal of the read data ECC, an ECC module is arranged in the memory controller, an error signal connecting line connected with the ECC module and used for acquiring the error reporting signal of the read data ECC is arranged, and when the error reporting signal of the read data ECC is generated, an effective signal, such as a high level, is generated on the error signal connecting line, so that the error signal is acquired.

Therefore, specifically, the step of acquiring the error signal of the memory error may be:

and acquiring effective signals of the error signal connecting lines which are configured and connected in advance.

Based on the different types of information of the error signals, an error signal wiring for reading a data ECC error signal, an error signal wiring for reading a data CRC error signal, an error signal wiring for writing a data CRC error signal, and an error signal wiring for a command or address parity error may be included.

It should be noted that different types of error signal connection lines may be the same error signal connection line, specifically in the DDR4/DDR5 memory subsystem, the error signal connection line of the write data CRC error signal and the error signal connection line of the command or address parity error share the same error signal, at this time, it is necessary to configure a distinction information in advance to distinguish the error signal of the write data CRC error signal and the error signal of the command or address parity error, specifically, the error signal may be a low-level pulse width, the memory may use different low-level pulse widths to represent two different errors, a shorter low-level pulse width represents the write data CRC error, a longer low-level pulse width represents the command or address parity error, and the memory controller may count the low-level time of the signal to determine which error occurs.

Which error signal connection generates a valid signal, the signal type of the acquired error signal is corresponding to which error signal.

In some embodiments, the valid signal may be set in a register by software at initialization of the memory controller; in other embodiments, the memory controller may be directly burned into the memory controller hardware during its manufacture. It is easy to understand that the following maintenance update is facilitated by a software mode, and the reliability is ensured by a hardware mode.

Of course, when the step of obtaining the error signal of the memory error in the memory error processing method according to the embodiment of the present application is executed, each hardware module and the error signal connection line are already set to be completed, and the configuration of the valid signal is also already completed.

In step S320, it is determined whether the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, if so, step S340 is executed, and if not, step S330 is executed.

When an error signal of a memory error is acquired, further determining the type of the memory error according to the type information of the error signal, if the type of the memory error cannot be corrected, executing step S330, further determining whether a recovery command corresponding to the memory error exists in a pre-stored recovery command sequence according to the type information of the error signal, if the type of the memory error can be corrected, executing step S340, and determining the memory where the memory error occurs according to the error memory identifier and all pre-stored memory information.

Specifically, the memory errors that can be corrected according to the data check code of the memory error may include ECC errors that can be corrected.

In step S330, it is determined whether there is a recovery command corresponding to the memory error in a pre-stored recovery command sequence according to the type information of the error signal, if so, step S340 is executed, otherwise, step S380 is executed.

And if the memory error is determined to be an error which can not be corrected according to the data check code of the memory error through judgment, further determining whether a recovery command corresponding to the memory error exists in a pre-stored recovery command sequence.

In particular, the recovery command sequence may be pre-stored in the memory controller, and in one embodiment may be written upon initialization of the memory controller by the microprocessor.

It is easy to understand that, since the recovery command sequence corresponds to the error type of the memory error, a search needs to be performed in the recovery command sequence stored in advance based on the error type. For ease of understanding, the following are examples:

when a read data CRC error signal is acquired, searching a corresponding recovery command in a recovery command sequence: a precharge command, whereby the precharge command may close all activated rows of the memory to restore the memory to a normal operating state.

When the write data CRC error signal is acquired, a corresponding pre-charge command, a command for reading a write CRC error state bit in a DRAM mode register and a command for clearing the write CRC error state bit in the DRAM mode register are searched in a recovery command sequence, so that the memory is recovered to a normal working state through the execution of each command.

Of course, in some embodiments, the recovery command sequence may be written by software at initialization of the memory controller; in other embodiments, the resume command sequence may also be burned into the memory controller directly at processor production time. The software mode is favorable for subsequent maintenance and updating, and the hardware mode is favorable for ensuring the reliability.

It is easily understood that in other embodiments, the execution sequence of step S320 and step S330 may be adjusted as needed, and this embodiment only exemplifies one case.

In step S340, the memory where the memory error occurs is determined according to the error memory identifier and all pre-stored memory information.

When it is determined that the memory error can be corrected according to the data check code of the memory error or it is determined that a recovery command corresponding to the memory error exists in a pre-stored recovery command sequence according to the type information of the error signal, the memory in which the memory error occurs is further determined according to the error memory identifier in the error signal and all pre-stored memory information corresponding to the memory controller.

It should be noted that the total memory information described herein refers to the identifiers of all the memories managed by the memory controller and the memory address information corresponding to each memory identifier, and the corresponding memory may be determined based on the memory address information, and then the recovery command is sent to the corresponding memory in the following.

In an embodiment, the memory address information may be a slot number corresponding to the aforementioned memory.

Because all the memory information is stored in the memory controller in advance, after the wrong memory identifier is obtained, the memory controller searches according to the wrong memory identifier, and further can determine the memory.

In step S350, the memory is corrected according to the data check code or the recovery command is sent to the memory where the memory error occurs.

And after the memory with the memory error is determined, correcting the memory or sending the recovery command to the memory with the memory error according to the data check code.

It is easy to understand that, when the memory error is an error which can be corrected according to the data check code of the memory error, the memory is corrected according to the data check code, that is, the data stored in the memory is corrected; and if the memory error is that the corresponding recovery command is in a pre-stored recovery command sequence, sending the recovery command to the memory with the memory error, so that the memory is recovered to a normal working state.

If the memory can be restored to the normal working state by the restoration command, the memory can continue to receive the command of the memory controller without restarting the chip, and the memory can be restored only in a short time.

In another embodiment, in order to reduce the damage caused by the memory error and improve the efficiency of processing the command after the memory error occurs, please refer to fig. 4, where the method for processing the memory error provided in the embodiment of the present application may further include:

in step S360, it is determined whether the memory recovers to a normal operating state, if yes, step S371 is executed, and if no, step S380 is executed.

The memory itself sending the recovery command to the error occurrence is also interactive with the memory, so that the situations that the recovery command is executed smoothly and the memory is recovered to a normal working state may occur, and the memory error may also occur, that is, when the memory enters a state of receiving only the recovery command due to the memory error occurrence, the memory error occurs when the recovery command is received. Therefore, after the recovery command is sent, whether the recovery command is completely sent and executed is further judged, and the memory can be recovered to be in a normal working state only when the recovery command is completely sent and executed. When the memory is not recovered to the normal state, that is, when a new memory error occurs during the process of sending the recovery command and the memory cannot execute the memory recovery command, further processing is required, and in a specific embodiment, the memory error may be considered as an unrecoverable error, and step S380 is executed.

In step S371, the stored retransmission command is retransmitted to the memory.

When the memory returns to the normal working state, in order to improve the command processing efficiency after the memory error occurs, the stored resending command can be sent to the memory again.

It is easily understood that the retransmission command refers to a memory command stored in the memory controller that has been sent to the memory but has not been executed, and in some embodiments, the retransmission command may be stored in a queue, where the retransmission command queue includes a queue and a queue pointer control circuit, the command that wins arbitration is sent to the memory and is stored in the retransmission command queue, and the queue write pointer is incremented; when the command is successfully executed, the corresponding read pointer is increased; when the memory has errors, the command between the read pointer and the write pointer is retransmitted.

It is easy to understand that the arbitration refers to an arbitration module in the memory controller, and the arbitration module arbitrates the memory commands to be sent according to a certain rule, and only one memory command is sent after one arbitration.

It should be noted that, the memory controller does not store all the memory commands that have been sent to the memory but not executed, but selectively stores specific memory commands therein, where the specific memory commands correspond to the error signals that are enabled when the memory controller initializes, and specifically, when only the write data CRC check is enabled, only the read and write commands among the memory commands that have been sent to the memory but not executed are stored. Since other memory commands (such as the DRAM register operation command MRS, etc.) are not protected by the write data CRC check. It is easy to understand that, when the memory error is neither an error that can be corrected according to the data check code of the memory error nor in the pre-stored recovery command sequence, the memory controller determines that the error is an unrecoverable error, and there is no subsequent memory recovery and retransmission operation after the memory recovery, so that the memory command causing the memory error does not need to be stored.

Therefore, by storing the memory command which is not completely executed, the memory command is retransmitted after the memory is recovered to the normal working state, so that the influence that a new memory command cannot be received and the memory command is executed when the memory is in error can be accurately eliminated, the loss of the part of the memory command is prevented, and the damage caused by the memory error is minimized.

In one embodiment, the resending command includes a refresh command and a non-refresh command, and the refresh command refers to a command for refreshing the memory.

And when the retransmission command is transmitted, retransmitting the retransmission command to the memory according to a refresh command priority principle, wherein the refresh command priority principle refers to the memory command transmission with the refresh command prior to the non-refresh command.

Therefore, the data stored in the memory can be ensured not to be lost due to untimely refreshing.

Further, in a specific embodiment, the step of resending the refresh command and the non-refresh command to the memory according to the refresh command priority principle may include:

resending each refresh command to the memory according to a time sequence priority principle;

and retransmitting each non-refresh command to the memory according to a time sequence priority principle, wherein the time sequence priority principle refers to that the memory command transmitted to the memory firstly is transmitted in preference to the memory command transmitted to the memory.

That is, the resending command is resent to the memory according to a timing priority principle on the premise of not violating a refresh command priority principle, wherein the timing priority principle refers to that the memory command sent to the memory first is resent in preference to the memory command sent to the memory later.

In this way, it is further ensured that the timing of the retransmitted command for retransmission still conforms to the arbitration rules, thereby preventing the occurrence of timing errors.

In one embodiment, in order to facilitate the control unit to know the execution status of the memory, in step S360, when it is determined that the memory has recovered to the normal operating state, the following steps may be further performed:

in step S372, the memory error generated is fed back as a corrected error.

The corrected error is fed back to the microprocessor controlling the memory controller and further to the operating system.

Therefore, after the recoverable error is corrected, the error processing method provided by the embodiment of the application can enable the control unit (microprocessor) and the like to know the execution condition of the memory by feeding back the corrected error to the operating system after the recoverable error is corrected, so that the subsequent processing is facilitated.

In another specific embodiment, the conclusion of step S330 is no, that is, when it is determined that there is no recovery command corresponding to the memory error in the pre-stored recovery command sequence according to the type information of the error signal, or when the conclusion of step S360 is no, that is, the memory is not restored to the normal operating state, in step S380, a request for restarting or resetting the memory subsystem is sent.

The memory subsystem includes a memory, a memory controller, and a physical Port (PHY) between the memory and the memory controller, and it is easily understood that when there is no recovery command corresponding to a memory error in a recovery command sequence, or when a recovery command is sent but not recovered, the memory error cannot be recovered by the recovery command, and the error state of the error memory can only be ended by restarting or resetting the memory subsystem.

It should be noted that, in the prior art, the whole SoC needs to be restarted, and the memory error processing method provided in the embodiment of the present application can complete processing of a memory error only through the memory subsystem, so that the memory error state can be ended only by restarting the memory subsystem, thereby avoiding an influence on the operation of other modules in the SoC.

It should be noted that the memory error handling method provided in the embodiment of the present application is not limited by the type of the memory subsystem, and may be applied to various memory subsystems, including DDR4 DRAM, UDIMM, RDIMM, LRDIMM, NVDIMM-N, and DDR5 DRAM, UDIMM, RDIMM, LRDIMM, and NVDIMM-N.

Therefore, by sending a request for restarting or resetting the memory subsystem, the memory error which cannot be recovered through the recovery command can be processed, and the memory error can still be processed when the error occurs.

In an embodiment, please refer to fig. 5, wherein fig. 5 is another flowchart of a memory error handling method according to an embodiment of the present disclosure.

It should be noted that most of the contents in fig. 5 are similar to those in fig. 4, and a description of the contents is not provided. As shown in fig. 5, the specific steps of the memory error processing method provided in the embodiment of the present application may include:

in step S410, an error signal of a memory error is obtained.

For details of step S410, please refer to the description of step S310 shown in fig. 4, which is not repeated herein.

In step S420, sending the memory command to the memory is stopped.

When the memory controller acquires the error signal, a memory error occurs, and the memory with the memory error enters an error state, and cannot receive commands except for the recovery command of the memory controller, at the moment, the command is continuously sent to the memory, and only the number of the retransmission commands is increased, so that the memory controller can stop sending the memory command to the memory.

Thus, the number of memory commands that are lost by being sent to the error memory after a memory error occurs or the number of retransmit commands can be reduced, thereby reducing the effect of memory errors.

In step S430, it is determined whether the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, if so, step S450 is executed, and if not, step S440 is executed.

In step S440, according to the type information of the error signal, it is determined whether there is a recovery command corresponding to the memory error in a pre-stored recovery command sequence, if yes, step S450 is executed, otherwise, step S490 is executed.

In step S450, the memory where the memory error occurs is determined according to the error memory identifier and all pre-stored memory information.

In step S460, the memory is corrected or the recovery command is sent to the memory with the memory error according to the data check code.

In step S470, it is determined whether the memory is restored to a normal operating state, if yes, step S480 is executed, and if not, step S490 is executed.

In step S480, the stored resend command is resent to the memory.

In step S490, a request to restart or reset the memory subsystem is sent.

For details of steps S430 to S490, please refer to the description of steps S320 to S380 shown in fig. 4, which is not repeated herein.

In an embodiment, please refer to fig. 6, where fig. 6 is a flowchart illustrating a memory error handling method according to an embodiment of the present disclosure.

As shown in fig. 6, the steps of the memory error handling method may include:

in step S51, an error signal of a memory error is obtained.

In step S52, it is determined whether the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, if yes, step S53 is performed, and if no, step S54 is performed.

In step S53, the memory where the error occurs is determined according to the error memory identifier and all pre-stored memory information.

In step S54, it is determined whether there is a recovery command corresponding to the memory miss in a pre-stored recovery command sequence according to the type information of the error signal, if yes, step S55 is executed, and if no, step S510 is executed.

For details of steps S51-S54, please refer to the descriptions of steps S310-S340 shown in FIG. 4, which are not repeated herein.

In step S55, the cumulative number of errors in acquiring the error signal is recorded.

Every time a memory error belonging to a recoverable error is judged to occur, the accumulated error times are recorded, specifically, the accumulated error times can be added by 1.

Thus, the number of times of error occurrence can be clearly known.

In some embodiments, after the step S55 of recording the accumulated number of errors of acquiring the error signal, the method further includes:

in step S56, it is determined whether the accumulated error count is less than the threshold, and if so, step S53 is performed, and if not, step S510 is performed.

After recording the number of accumulated errors each time, it is determined whether the accumulated errors reach a threshold, and the threshold is preset.

Therefore, the situation that the number of times of repeating the step of obtaining the memory error signal is too large and even deadlock occurs can be effectively prevented. The deadlock, such as a memory error caused by the a recovery command, requires a B recovery command, which in turn causes a memory error requiring the use of the a recovery command, causing the memory controller to continually repeat the memory error handling operation.

In step S57, the memory is corrected or the recovery command is sent to the memory where the memory error occurred according to the data check code or the error signal.

Please refer to step S350 shown in fig. 3 for details of step S57, which are not repeated herein.

In some embodiments, after step S57, the method further includes:

in step S58, it is determined whether the memory has recovered to a normal operating state, if yes, step S591 is performed, and if no, step S51 is performed again.

If the result of determining whether the memory recovers to the normal operating state is negative, step S51 is executed again, that is, the error signal of the memory error is obtained again. Of course, the reacquired memory error signal is a memory error caused by sending of the recovery command, and specifically, in the process of sending the recovery command, when the error signal of the memory error is reacquired, it is determined whether to obtain the recovery command corresponding to the memory error again according to the type information of the error signal and the pre-stored recovery command sequence.

Therefore, when the memory error caused by the recovery command is also a recoverable error, the memory error processing method provided by the embodiment of the application can be used for processing continuously until the recovered memory is in a normal working state or the originally occurred memory error is determined to be an unrecoverable error again.

In step S591, the stored resend command is sent to the memory again.

Please refer to step S371 in fig. 3 for details of step S591, which are not repeated herein.

In one embodiment, in order to facilitate the control unit to know the execution status of the memory, in step S58, when the memory is determined to have recovered to the normal operation state, the following steps may be further performed:

in step S592, the accumulated error number is set to an initial value.

Of course, the initial number of times is 0.

Thus, after the initialization is 0, the cumulative error number when the memory error occurs next time is recorded as 1, so that the cumulative error number can represent the number of times that the memory command of the non-recovery command causes the memory error and is judged as a recoverable error, the memory controller tries to recover the memory for processing the memory error, and the threshold setting of the cumulative error number is ensured to be the maximum number of times that the memory is tried to be recovered.

In another embodiment, when the conclusion of step S54 is no, that is, when it is determined that there is no recovery command corresponding to the memory error in the pre-stored recovery command sequence according to the type information of the error signal, or when the conclusion of step S56 is no, that is, the number of accumulated errors reaches the threshold value, in step S510, a request for restarting or resetting the memory subsystem is sent.

For details of step S510, please refer to the description of step S380 shown in fig. 3, which is not repeated herein.

It is easy to understand that the error causing the request for restarting or resetting the memory subsystem is an unrecoverable error, and therefore, when the memory error caused by the memory command of the non-recovery command is determined again as an unrecoverable error because the accumulated error number reaches the threshold value, the request for restarting or resetting the memory subsystem needs to be sent as well.

Specifically, the re-determining that the error caused by the memory command of the non-recovery command is an unrecoverable error means that although the memory error which initially causes the memory to enter the error state confirms that a pre-stored recovery command sequence corresponding to the pre-stored recovery command sequence belongs to a recoverable error when the cumulative error is 1, the memory error is caused by the subsequently transmitted recovery command for processing the recoverable error once, the cumulative error reaches a threshold of the cumulative error, the memory error caused by the recovery command reaching the threshold is considered to be an unrecoverable error, and a request for restarting or resetting the memory subsystem is transmitted according to the flow operation of the unrecoverable error, so that the memory error which is considered to be recorded as 1 is actually re-determined to be an unrecoverable error.

Therefore, the recovery of the error memory is stopped when the accumulated recovery times are reached, and then the memory subsystem is restarted or reset, so that the situations that the recovery times are too long and even deadlock occurs can be effectively prevented.

In an embodiment, please refer to fig. 7, wherein fig. 7 is a further flowchart of a memory error handling method according to an embodiment of the present disclosure.

As shown in fig. 7, the steps of the memory error handling method may include:

in step S61, an error signal of a memory error is obtained.

In step S62, it is determined whether the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, if so, step S64 is performed, and if not, step S63 is performed.

In step S63, it is determined whether there is a recovery command corresponding to the memory error in a pre-stored recovery command sequence according to the type information of the error signal, if so, step S64 is executed, otherwise, step S611 is executed.

In step S64, the memory where the memory error occurs is determined according to the error memory identifier and all the pre-stored memory information.

In S65, the memory is corrected according to the data check code or the recovery command is sent to the memory in which the memory error occurred.

In step S66, it is determined whether the memory has recovered to a normal operating state, if yes, step S67 is executed, and if no, step S611 is executed.

For details of steps S61-S66, please refer to the description of steps S310-S360 shown in FIG. 4, which is not repeated herein.

In step S67, the accumulated number of times of recovery of the memory to the normal operating state is recorded.

And when the recovery command is completely sent and executed and the memory is recovered to the normal working state, recording the successful recovery times of the memory to the accumulated recovery times.

In step S68, it is determined whether the accumulated recovery time is less than a preset threshold, and if so, step S69 is performed, and if not, step S610 is performed.

After the memory is successfully restored and the accumulated restoration times are recorded, the accumulated restoration times are compared according to a preset threshold value to judge the subsequent execution steps, and specifically, the recording can be performed through software.

The temperature of the memory may rise after long-time operation, the change of the temperature of the memory may cause the impedance of the input/output interface to change, thereby causing signal reflection, which may cause the signal quality to be poor and may cause data errors seriously. Therefore, it is necessary to record the number of memory errors that occur, and to take measures to eliminate the temperature-induced impedance drift when the cumulative number of memory recovery times reaches a threshold.

In step S69, the stored resend command is resent to the memory.

For details of step S69, please refer to the description of step S371 shown in fig. 4, which is not repeated herein.

In step S610, a memory impedance calibration operation is performed.

When the accumulated recovery times of the memory reaches the threshold, measures for eliminating the impedance drift caused by the temperature need to be taken, specifically, the memory impedance calibration operation may be performed. Of course, when the impedance calibration is performed on the error memory, the error memory cannot receive the memory command, so that it is necessary to preferentially determine whether the accumulated recovery times reach the threshold value, and then resend the stored resending command to the memory.

In step S611, a request to restart or reset the memory subsystem is sent.

For details of step S611, please refer to step S380 described in fig. 4, which is not described herein again.

In order to solve the foregoing problems, an embodiment of the present application further provides a memory error processing apparatus, which may be regarded as a functional module that is required to implement the memory error processing method provided in the embodiment of the present application. The contents of the apparatus described herein are referred to in correspondence with the contents of the method described above.

Referring to fig. 8, fig. 8 is a block diagram of a memory error processing apparatus according to an embodiment of the present disclosure.

The device can be applied to the memory error processing method provided by the embodiment of the application. As shown in fig. 8, the memory error processing apparatus provided in the embodiment of the present application may include:

the memory error processing module 320 is adapted to determine, when it is determined that the memory error can be corrected according to the data check code of the memory error or it is determined that a recovery command corresponding to the memory error exists in a pre-stored recovery command sequence according to the type information of the error signal, a memory in which the memory error occurs according to the erroneous memory identifier and all pre-stored memory information, and correct the memory according to the data check code or send the recovery command to the memory.

It can be seen that, according to the memory error processing apparatus provided in the embodiment of the present application, by pre-storing all memory information in the memory controller, the memory in which the corresponding memory error occurs can be determined according to the error memory identifier of the memory error, so that when it is determined that the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, the corresponding memory can be directly corrected according to the data check code; in addition, a recovery command corresponding to the type of the memory error can be stored in the memory controller in advance, so that when the recovery command corresponding to the memory error is determined in the pre-stored recovery command sequence, the recovery of the memory from the error state to the normal working state can be realized by using the corresponding recovery command, further, the memory error which cannot be directly corrected according to the error signal can be recovered without restarting the SoC. It can be seen that the whole processing process mainly relates to a memory controller and a memory, the transmission path of signals is short, the processing logic is simple, the recovery time of the memory from an error state entering due to memory errors to a normal working state is short, and the memory error processing efficiency can be improved.

In some embodiments, further comprising:

the resending command module 330 is adapted to resend the resending commands that have been stored to the memory when the memory returns to a normal operating state, where the resending commands include a memory command that is sent to the memory but is not executed completely.

In some embodiments, the retransmit command includes a refresh command and a non-refresh command;

a resend command module 330, adapted to resend the resend command to the memory, comprising:

and retransmitting the refresh command and the non-refresh command to the memory according to a refresh command priority principle, wherein the refresh command priority principle refers to the memory command transmission with the refresh command prior to the non-refresh command.

In some embodiments, each of the refresh commands is resent to the memory according to a time-sequence priority principle;

In some embodiments, further comprising:

the impedance calibration module 770 is adapted to record the accumulated recovery times of the memory recovering to the normal operating state when the memory recovers to the normal operating state, and execute the memory impedance calibration operation when the accumulated recovery times reaches a preset threshold.

In some embodiments, further comprising:

a memory subsystem restart or reset module 740 adapted to send a request to restart or reset the memory subsystem when it is determined that there is no recovery command corresponding to the memory error in the pre-stored recovery command sequence according to the type information of the error signal.

In some embodiments, the memory subsystem restart or reset module 740, adapted to send a request to restart or reset the memory subsystem, comprises:

sending a request to a control unit to restart or reset the memory subsystem, the control unit adapted to initialize registers of the memory subsystem.

In some embodiments, the memory error handling module 320 is further adapted to:

and when the memory is recovered to the normal working state, feeding back the memory error which is generated as a corrected error.

In some embodiments, further comprising:

the memory command sending module 750 is adapted to stop sending the memory command to the memory after the error signal obtaining module obtains the error signal of the memory error.

In some embodiments, the memory error processing module 320 is further adapted to determine, when an error signal of a memory error is acquired again, whether to acquire a recovery command corresponding to the memory error again according to the type information of the error signal and a pre-stored recovery command sequence.

In some embodiments, further comprising:

the accumulated error number recording module 760 is suitable for acquiring the accumulated error number of the error signal after the error signal acquisition module acquires the error signal of the memory error;

the memory error processing module 320 is further adapted to send the recovery command to the memory in which the memory error occurs when the accumulated error times do not reach the time threshold and the recovery command corresponding to the memory error is determined to exist in the pre-stored recovery command sequence according to the type information of the error signal.

In some embodiments, the memory subsystem restart or reset module 740 is further adapted to send a request to restart or reset the memory subsystem when the accumulated number of errors reaches a threshold number of times.

In some embodiments, the accumulated error number recording module 760 is further adapted to set the error number as an initial number after the step of resending the command to the memory when the memory recovers to a normal operating state, where the initial number is 0.

An embodiment of the present application further provides a memory controller, where the processor may include the memory error processing apparatus provided in the embodiment of the present application.

Therefore, the memory controller provided in the embodiment of the present application, by pre-storing all memory information in the memory controller, may determine, according to the error memory identifier of the memory error, the memory in which the corresponding memory error occurs, so that when it is determined, according to the type information of the error signal, that the memory error can be corrected according to the data check code of the memory error, the corresponding memory may be corrected directly according to the data check code; in addition, a recovery command corresponding to the type of the memory error can be stored in the memory controller in advance, so that when the recovery command corresponding to the memory error is determined in the pre-stored recovery command sequence, the recovery of the memory from the error state to the normal working state can be realized by using the corresponding recovery command, further, the memory error which cannot be directly corrected according to the error signal can be recovered without restarting the SoC. It can be seen that the whole processing process mainly relates to a memory controller and a memory, the transmission path of signals is short, the processing logic is simple, the recovery time of the memory from an error state entering due to memory errors to a normal working state is short, and the memory error processing efficiency can be improved.

An embodiment of the present application further provides a processor, and the electronic device may include the memory error processing apparatus provided in the embodiment of the present application.

It can be seen that, in the processor provided in the embodiment of the present application, by pre-storing all the memory information in the memory controller, the memory in which the corresponding memory error occurs can be determined according to the error memory identifier of the memory error, so that when it is determined that the memory error can be corrected according to the data check code of the memory error according to the type information of the error signal, the corresponding memory can be directly corrected according to the data check code; in addition, a recovery command corresponding to the type of the memory error can be stored in the memory controller in advance, so that when the recovery command corresponding to the memory error is determined in the pre-stored recovery command sequence, the recovery of the memory from the error state to the normal working state can be realized by using the corresponding recovery command, further, the memory error which cannot be directly corrected according to the error signal can be recovered without restarting the SoC. It can be seen that the whole processing process mainly relates to a memory controller and a memory, the transmission path of signals is short, the processing logic is simple, the recovery time of the memory from an error state entering due to memory errors to a normal working state is short, and the memory error processing efficiency can be improved.

While various embodiments have been described above in connection with what are presently considered to be the embodiments of the disclosure, the various alternatives described in the various embodiments can be readily combined and cross-referenced without conflict to extend the variety of possible embodiments that can be considered to be the disclosed and disclosed embodiments of the disclosure.

Although the embodiments of the present application are disclosed above, the present application is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present disclosure, and it is intended that the scope of the present disclosure be defined by the appended claims.

Claims

1. A memory error handling method is applicable to a memory controller, and comprises the following steps:

2. The method of claim 1, wherein after the step of sending the recovery command to the memory in which the memory error occurred, the method further comprises:

and when the memory recovers to a normal working state, retransmitting the stored retransmission command to the memory, wherein the retransmission command comprises the memory command which is transmitted to the memory but is not executed completely.

3. The memory error handling method of claim 2, wherein the retry command includes a refresh command and a non-refresh command;

the step of resending the resending command to the memory comprises:

and retransmitting the refresh command and the non-refresh command to the memory according to a refresh command priority principle, wherein the refresh command priority principle refers to that the refresh command is transmitted in priority to the non-refresh command.

4. The memory error handling method of claim 3, wherein the step of resending the refresh command and the non-refresh command to the memory on a refresh command priority basis comprises:

5. The method of claim 2, wherein after the step of sending the recovery command to the memory in which the memory error occurred, the method further comprises:

when the memory recovers to a normal working state, recording the accumulated recovery times of the memory recovering to the normal working state, and when the accumulated recovery times reach a preset threshold value, executing the memory impedance calibration operation.

6. The memory error handling method of any of claims 1-5, further comprising: and when the memory error cannot be corrected according to the data check code of the memory error and no recovery command corresponding to the memory error exists in a pre-stored recovery command sequence, according to the type information of the error signal, sending a request for restarting or resetting the memory subsystem.

7. The memory error handling method of claim 6, wherein the step of sending a request to reboot or reset the memory subsystem comprises:

8. The memory error handling method of any of claims 1-5, wherein after the step of sending the recovery command to the memory in which the memory error occurred, further comprising:

9. The memory error handling method of any of claims 1-5, wherein after the step of obtaining the error signal of the memory error, further comprising:

and stopping sending the memory command to the memory.

10. The memory error handling method of any of claims 1-5, wherein after the step of sending the recovery command to the memory in which the memory error occurred, further comprising:

and when the error signal of the memory error is acquired again, determining whether a recovery command corresponding to the memory error is acquired again according to the type information of the error signal and a pre-stored recovery command sequence.

11. The method of claim 10, wherein the step of obtaining the error signal of the memory error is followed by the step of:

recording and acquiring the accumulated error times of the error signal;

when it is determined that a recovery command corresponding to the memory error exists in a pre-stored recovery command sequence according to the type information of the error signal, the step of sending the recovery command to the memory in which the memory error occurs includes:

and when the accumulated error frequency does not reach a frequency threshold value and a recovery command corresponding to the memory error is determined in a pre-stored recovery command sequence according to the type information of the error signal, sending the recovery command to the memory in which the memory error occurs.

12. The memory error handling method of claim 11, further comprising:

and when the accumulated error times reach a time threshold value, sending a request for restarting or resetting the memory subsystem.

13. The method as claimed in claim 12, wherein after the step of resending the resend command to the memory when the memory recovers to a normal operating state, the method further comprises:

and setting the error times as initial times, wherein the initial times is 0.

14. A memory error handling apparatus, adapted for a memory controller, comprising:

15. The memory error handling device of claim 14, further comprising:

and the resending command module is suitable for resending the resending commands which are stored to the memory when the memory is restored to a normal working state, wherein the resending commands comprise memory commands which are sent to the memory but are not executed completely.

16. The memory error handling device of claim 15, wherein the retry command includes a refresh command and a non-refresh command;

the resending command module is adapted to resend the resending command to the memory, and includes:

17. The memory error handling device of claim 16, wherein the step of resending the refresh command and the non-refresh command to the memory on a refresh command priority basis comprises:

18. The memory error handling device of claim 15, further comprising:

and the impedance calibration module is suitable for recording the accumulated recovery times of the memory to the normal working state when the memory recovers to the normal working state, and executing the memory impedance calibration operation when the accumulated recovery times reaches a preset threshold value.

19. The memory error handling device of any of claims 14-18, further comprising:

and the memory subsystem restarting or resetting module is suitable for sending a request for restarting or resetting the memory subsystem when determining that no recovery command corresponding to the memory error exists in a pre-stored recovery command sequence according to the type information of the error signal.

20. The memory error handling device of claim 19, wherein the memory subsystem restart or reset module, adapted to send the request to restart or reset the memory subsystem, comprises:

21. The memory error handling device of claims 14-18, wherein the memory error handling module is further adapted to:

22. The memory error handling device of claims 14-18, further comprising:

and the memory command sending module is suitable for stopping sending the memory command to the memory after the error signal acquisition module acquires the error signal of the memory error.

23. The memory error handling device according to claims 14 to 18, wherein the memory error handling module is further adapted to determine whether to acquire the recovery command corresponding to the memory error again according to the type information of the error signal and a pre-stored recovery command sequence when acquiring the error signal of the memory error again.

24. The memory error handling device of claim 23, further comprising:

the accumulated error frequency recording module is suitable for acquiring the accumulated error frequency of the error signal after the error signal acquisition module acquires the error signal of the memory error;

the memory error processing module is further adapted to send the recovery command to the memory in which the memory error occurs when the accumulated error times do not reach a time threshold and the recovery command corresponding to the memory error is determined to exist in a pre-stored recovery command sequence according to the type information of the error signal.

25. The memory error handling device of claim 24, wherein the memory subsystem restart or reset module is further adapted to send a request to restart or reset the memory subsystem when the accumulated number of errors reaches a threshold number of times.

26. The memory error handling device according to claim 25, wherein the accumulated error times recording module is further adapted to set the error times as an initial times after the step of resending the resend command to the memory when the memory recovers to a normal operating state, and the initial times is 0.

27. A memory controller comprising a memory error handling apparatus as claimed in any one of claims 14 to 26.

28. A processor comprising a memory error handling apparatus as claimed in any one of claims 14 to 26.