WO2024040502A1

WO2024040502A1 - Apparatus, device, method, and computer program for persisting memory recovery actions

Info

Publication number: WO2024040502A1
Application number: PCT/CN2022/114728
Authority: WO
Inventors: Tao Xu; Shijie Liu; Lei Zhu; Sarathy Jayakumar; Yufu Li
Original assignee: Intel Corporation
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2024-02-29

Abstract

Examples relate to an apparatus, device, method, and computer program for persisting memory recovery actions, and to a computer system comprising such an apparatus or device. An apparatus or device for persisting memory recovery actions is configured to determine one or more memory recovery actions taken by a memory controller with respect to memory cir-cuitry, and to store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry being co-located with the memory circuitry.

Description

Apparatus, Device, Method, and Computer Program for Persisting Memory Recovery Actions

Field

Examples relate to an apparatus, device, method, and computer program for persisting memory recovery actions, and to a computer system comprising such an apparatus or device.

Background

Reliability, availability, and serviceability (RAS) relates to features that become increasingly important, in particular in server systems, such as server system used by cloud service pro-viders (CSPs) . For a long time, the use of a server indicated the use of one operating system and one application. Because of the relatively inexpensive cost of x86 servers in relation to larger proprietary systems, Information Technology (IT) departments were happy to deploy a server for every application, partitioning applications and creating self-contained failure zones. While simple and cost effective, this containment strategy led to server sprawl. Virtu-alization solved the sprawl, but also changed this failure dynamic, creating an environment where a CPU failure could impact multiple applications or environments. This change drove the need for more RAS in processors to help maintain better application uptime and help lower total cost of ownership (TCO) for businesses. Each new generation of CPU brought new lev-els of RAS as business criticality increased. For today’s always-on business world that de-mands real-time analysis and insight, boosting server uptime pays dividends back to organi-zations by enabling them to be more agile and flexible. However, this environment also pushes the envelope, as systems need constant availability. Application design has moved from smaller, self-contained applications to a more services-oriented environment that drives a greater CPU RAS need; outages in a horizontal service can now have a more dramatic im-pact on multiple applications as the failure zones have increased, as their impact is multiplied.

Brief description of the Figures

Some examples of apparatuses and/or methods will be described in the following by way of example only, and with reference to the accompanying figures, in which

Fig. 1a shows a schematic diagram of an example of an apparatus or device for persisting memory recovery actions, and of a computer system comprising such an apparatus or device;

Fig. 1b shows a schematic diagram of an example of a method for persisting memory recovery actions;

Figs. 1c and 1d show schematic diagrams of examples of a computer system comprising an apparatus or device for persisting memory recovery actions;

Fig. 2a shows a schematic diagram of an example of a Dual Inline Memory Module;

Fig. 2b shows a schematic diagram of an example of a High Bandwidth Memory Module;

Fig. 3 shows a schematic diagram of a hardware rank corrected error counter and threshold;

Fig. 4 shows a schematic diagram of a Post Package Repair workflow;

Fig. 5 shows a schematic diagram of a Partial Cache Line Sparing flow;

Fig. 6 shows a schematic diagram of a virtual lockstep used in Double Device Data Correction;

Figs. 7a and 7b show schematic diagrams of an Advanced Double Device Data Correction flow;

Fig. 8 shows a schematic diagram of how Reliability, Availability and Serviceability actions are lost after S3, reset and power off;

Fig. 9 shows a schematic diagram illustrating how Dual Inline Memory Module error infor-mation and Reliability, Availability and Serviceability actions are lost after Dual In-line Memory Module migration;

Fig. 10 shows a schematic diagram of the proposed approach to implement Reliability, Avail-ability and Serviceability recovery action replay logic;

Fig. 11 shows a more detailed schematic diagram of the proposed Reliability, Availability and Serviceability recovery action replay logic implementation; and

Fig. 12 shows a table of an example of a Reliability, Availability and Serviceability action data structure.

Detailed Description

Some examples are now described in more detail with reference to the enclosed figures. How-ever, other possible examples are not limited to the features of these embodiments described in detail. Other examples may include modifications of the features as well as equivalents and alternatives to the features. Furthermore, the terminology used herein to describe certain ex-amples should not be restrictive of further possible examples.

Throughout the description of the figures same or similar reference numerals refer to same or similar elements and/or features, which may be identical or implemented in a modified form while providing the same or a similar function. The thickness of lines, layers and/or areas in the figures may also be exaggerated for clarification.

When two elements A and B are combined using an “or” , this is to be understood as disclosing all possible combinations, i.e., only A, only B as well as A and B, unless expressly defined otherwise in the individual case. As an alternative wording for the same combinations, "at least one of A and B" or "A and/or B" may be used. This applies equivalently to combinations of more than two elements.

If a singular form, such as “a” , “an” and “the” is used and the use of only a single element is not defined as mandatory either explicitly or implicitly, further examples may also use several elements to implement the same function. If a function is described below as implemented using multiple elements, further examples may implement the same function using a single element or a single processing entity. It is further understood that the terms "include" , "in-cluding" , "comprise" and/or "comprising" , when used, describe the presence of the specified features, integers, steps, operations, processes, elements, components and/or a group thereof, but do not exclude the presence or addition of one or more other features, integers, steps, operations, processes, elements, components and/or a group thereof.

In the following description, specific details are set forth, but examples of the technologies described herein may be practiced without these specific details. Well-known circuits, struc-tures, and techniques have not been shown in detail to avoid obscuring an understanding of this description. “An example/example, ” “various examples/examples, ” “some examples/ex-amples, ” and the like may include features, structures, or characteristics, but not every exam-ple necessarily includes the particular features, structures, or characteristics.

Some examples may have some, all, or none of the features described for other examples. “First, ” “second, ” “third, ” and the like describe a common element and indicate different in-stances of like elements being referred to. Such adjectives do not imply element item so de-scribed must be in a given sequence, either temporally or spatially, in ranking, or any other manner. “Connected” may indicate elements are in direct physical or electrical contact with each other and “coupled” may indicate elements co-operate or interact with each other, but they may or may not be in direct physical or electrical contact.

As used herein, the terms “operating” , “executing” , or “running” as they pertain to software or firmware in relation to a system, device, platform, or resource are used interchangeably and can refer to software or firmware stored in one or more computer-readable storage media accessible by the system, device, platform, or resource, even though the instructions contained in the software or firmware are not actively being executed by the system, device, platform, or resource.

The description may use the phrases “in an example/example, ” “in examples/examples, ” “in some examples/examples, ” and/or “in various examples/examples, ” each of which may refer to one or more of the same or different examples. Furthermore, the terms “comprising, ” “in-cluding, ” “having, ” and the like, as used with respect to examples of the present disclosure, are synonymous.

Fig. 1a shows a schematic diagram of an example of an apparatus 10 or device 10 for persist-ing memory recovery actions. The apparatus 10 comprises circuitry that is configured to pro-vide the functionality of the apparatus 10. For example, the apparatus 10 of Figs. 1a, 1c and 1d comprises interface circuitry 12, processing circuitry 14 and (optional) storage circuitry 16. For example, the processing circuitry 14 may be coupled with the interface circuitry 12 and with the storage circuitry 16. For example, the processing circuitry 14 may be configured to provide the functionality of the apparatus, in conjunction with the interface circuitry 12 (for exchanging information, e.g., with other components inside or outside a computer system 100 comprising the apparatus or device 10, such as a memory controller 102) and the storage circuitry (for storing information, such as machine-readable instructions) 16. Likewise, the device 10 may comprise means that is/are configured to provide the functionality of the device 10. The components of the device 10 are defined as component means, which may correspond to, or implemented by, the respective structural components of the apparatus 10. For example, the device 10 of Figs. 1a and 1b comprises means for processing 14, which may correspond to or be implemented by the processing circuitry 14, means for communicating 12, which may correspond to or be implemented by the interface circuitry 12, and (optional) means for stor-ing information 16, which may correspond to or be implemented by the storage circuitry 16. In general, the functionality of the processing circuitry 14 or means for processing 14 may be implemented by the processing circuitry 14 or means for processing 14 executing machine-readable instructions. Accordingly, any feature ascribed to the processing circuitry 14 or means for processing 14 may be defined by one or more instructions of a plurality of machine-readable instructions. The apparatus 10 or device 10 may comprise the machine-readable in-structions, e.g., within the storage circuitry 16 or means for storing information 16. In the proposed concept, the functionality may be performed as part of a system firmware (e.g., Basic Input/Output System or Unified Extensible Firmware Interface) of the computer system.

The processing circuitry 14 or means for processing 14 is configured to determine one or more memory recovery actions taken by the memory controller 102 with respect to memory circuitry 22. The processing circuitry 14 or means for processing 14 is configured to store information on the one or more memory recovery actions being taken by the memory control-ler to storage circuitry 24 being co-located with the memory circuitry.

Fig. 1a further shows the computer system 100 comprising the apparatus 10 or device 10, with the computer system further comprising the memory controller 102. For example, as shown in connection with Figs. 8 and 9, the memory controller 102 may be an integrated memory controller (iMC) , i.e., a memory controller that is part of a Central Processing Unit (CPU) of the computer system. For example, the memory controller may be part of the pro-cessing circuitry 14 or means for processing 14 of the apparatus 10 or device 10. Alternatively, the memory controller 102 may be separate from the CPU and/or the apparatus 10.

In some examples, the memory circuitry 22 may be considered to be part of the computer system 102. If the memory circuitry 22 is part of a Dual Inline Memory Module (DIMM) 20 (as shown in Fig. 2a) , the memory circuitry may be considered to be part of the computer system 100 or, since it is removable, to be separate from the computer system 100. In other words, the computer system 100 may comprise the memory circuitry 22. In some examples, as shown in Fig. 2b, the memory circuitry 22 may be part of a High Bandwidth Memory module, which is directly coupled with the CPU via silicon interposer of a package of the CPU. For example, the memory circuitry 22 may be Dynamic Random Access Memory (DRAM) or persistent memory (PMEM) .

Fig. 1b shows a schematic diagram of an example of a corresponding method for persisting memory recovery actions. The method comprises determining 110 the one or more memory recovery actions taken by the memory controller 102 with respect to the memory circuitry 22. The method comprises storing 140 the information on the one or more memory recovery ac-tions being taken by the memory controller to the storage circuitry being co-located with the memory circuitry.

In the following, the features of the apparatus 10, device 10, method, computer program and computer system 100 are introduced with respect to the apparatus 10 and computer system 100. Features introduced in connection with the apparatus 10 and/or computer system 100 may likewise be included in the corresponding device 10, method and computer program.

Various examples of the present disclosure are based on the finding, that a number of different techniques exist that allow the use of memory circuitry despite persistent errors being present in the memory circuitry. For example, as shown in connection with Figs. 3 to 7, techniques such as Post Package Repair (PPR) , Partial Cache Line Sparing (PCLS) or Adaptive Double Device Data Correction can be used to allow a use of memory circuitry with persistent errors. These techniques are generally based on using redundant memory circuitry that is activated once an error persists (PPR and ADDDC) , or by including some measure of memory circuitry that can be used in lieu of the erroneous memory circuitry in the memory controller (PCLS) . These techniques are usually denoted RAS techniques (or RAS actions) , with RAS referring to Reliability, Availability and Serviceability. In the context of the present disclosure, these RAS actions are denoted memory recovery actions. Accordingly, the one or more memory recovery actions comprise one or more of a Post Package Repair (PPR) action, a Partial Cache Line Sparing (PCLS) action, and an Adaptive Double Device Data Correction (ADDDC) ac-tion. In effect, the one or more memory recovery actions are actions that allow the use of memory circuitry exhibiting one or more persistent errors.

In general, such RAS actions are taken when the memory errors are persistent errors (in con-trast to intermittent errors) , i.e., if they persist over time. For these memory recovery actions to work, their use, and parameters of their use, generally have to be known to the memory controller of the computer system. In particular, these techniques are performed by the re-spective memory controllers, e.g., in conjunction with redundant memory circuitry present in the memory circuitry or using memory circuitry that is part of the memory controller. When such a memory recovery action is required, e.g., when a correctable error threshold of some region of memory exceeds a threshold, the memory controller sets up the respective memory controller action and determines the respective parameters thereof (such as memory recovery action taken, address (es) affected by the memory recovery action, redundant memory cir-cuitry being used) , initiates the memory recovery action (e.g., by copying (sparing) the con-tent of the affected memory circuitry to redundant memory circuitry) , and keeps a record of the memory recovery action taken and the parameters being used in memory circuitry of the memory controller (as shown in Fig. 8/9, where this information in the RAS actions registers) . However, this information is only stored in memory circuitry –once the computer system is restarted or reset or some sleep states are used, this information is lost. In general, since the content of the memory is either not needed anymore (after a restart or reset) or restored (in some sleep states) , this may be regarded an annoyance, as the persistent error is likely to be detected again after the restart or reset (or restore from sleep) , triggering the use of the same (or different) memory recovery actions again. However, in some cases, additional errors may occur, turning correctable errors into uncorrectable errors (while the memory recovery actions are not being applied) . Moreover, some techniques, such as MPWR (memory-persistent warm reset) count on the content of the memory being intact after the warm reset –in this case, if the RAS actions being taken are lost, memory loss may occur (as the memory controller is unaware which redundant memory circuitry is being used) .

In the proposed concept, such failure scenarios can be avoided by persistently storing the information on the one or more memory recovery actions being taken by the memory control-ler using storage circuitry, i.e., such that the information on the one or more memory recovery actions being taken by the memory controller can be loaded again after a restart, reset etc. To avoid a loss of this information, e.g., in case the memory circuitry affected is transferred to another computer system (in case of the memory circuitry being part of a DIMM) , the infor-mation on the one or more memory recovery actions is stored in storage circuitry that is co-located with the respective memory circuitry. For example, the data structure shown in Fig. 12 (or similar data structures) may be used to store the respective information. For example, if the memory circuitry is part of a DIMM, the information on the one or more memory re-covery actions being taken by the memory controller can be stored using user-programmable blocks of storage circuitry offered by a Serial Presence Detect (SPD) controller of the DIMM. If the memory circuitry is part of HBM, the information on the one or more memory recovery actions being taken by the memory controller can be stored by or via a HBM controller being used to control the HBM. After the restart, reset, wake-up or DIMM transfer, the memory controller can be provided with the information on the one or more memory recovery actions being taken by the memory controller that is stored in the storage circuitry, and the memory controller can replay (i.e., re-initiate) the one or more memory recovery actions using the stored parameters. Thus, after restart, reset, wake-up or DIMM transfer, the memory circuitry can be operated with the same memory recovery actions and parameters as before the restart, reset, wake-up or DIMM transfer, reducing the likelihood of fatal errors and enabling the use of MPWR with memory recovery actions.

The process starts with determining the one or more memory recovery actions taken by the memory controller 102 with respect to the memory circuitry 22. Generally, the memory con-troller holds a list/registers (denoted RAS actions register in Fig. 8 and/or 9) of the memory recover action (s) currently used/taken by the memory circuitry. The processing circuitry 22 may be configured to request information on, or read out, the one or more memory recovery actions taken by the memory controller with respect to the memory circuitry from the memory controller 102.

In addition to the memory recovery actions taken, also the parameters being used may be determined (and later stored using the storage circuitry) . In other words, the processing cir-cuitry may be configured to determine parameters of the one or more memory recovery ac-tions, and to store the information on the memory recovery actions with the parameters of the one or more memory recovery actions. Accordingly, as further shown in Fig. 1b, the method may comprise determining 120 parameters of the one or more memory recovery actions and storing 140 the information on the memory recovery actions with the parameters of the one or more memory recovery actions. Such parameters may include parameters such as one or more of the (physical) memory address (es) on which the respective memory recovery action is applied, redundant memory circuitry being used by the respective memory recovery action and, in case of PCLS, the content of the memory being stored in the memory circuitry of the memory controller. To make sure all of the memory recovery actions are up to date, both the one or more memory recovery actions taken, and the parameters may be determined (or up-dated) (right) before the power off, reset, warm reset or sleep. In other words, the determina-tion (or an update of the) one or more memory recovery actions taken and the parameters may be performed before the (or each) power off event, removable memory module migration event (e.g., DIMM migration event) , reset event, memory persistent warm reset event and a suspend-to-random-access memory event.

To avoid introducing errors when replaying the one or more memory recover actions after respective power off/reset or suspend-to-random-access memory event, the stored infor-mation may be protected using bit error detection information or bit error recovery infor-mation, such as a CRC (Cyclic Redundancy Check) code or other suitable checksums or codes that can be used to detect or preferably correct bit errors within the stored information. In other words, the processing circuitry may be configured to calculate bit error detection infor-mation or bit error recovery information for the information on the one or more memory re-covery actions, and to store the information on the one or more memory recovery actions with the bit error detection information or bit error recovery information. Accordingly, as further shown in Fig. 1b, the method may comprise calculating 130 bit error detection information or bit error recovery information for the information on the one or more memory recovery ac-tions and storing 140 the information on the one or more memory recovery actions with the bit error detection information or bit error recovery information. As outlined above, known techniques such as CRC, XOR (exclusive OR) or LDPC (Low-Density Parity-Check) codes may be used. For example, the processing circuitry may be configured to calculate a cyclic redundancy check code or an LDPC code for each of the one or more memory recovery ac-tions, and to store the information on the one or more memory recovery actions with the cyclic redundancy check code or LDPC code.

Once the information to be stored is determined (or updated) , it can be stored in the storage circuitry 24 being co-located with the memory circuitry. In this context, co-located may mean that the storage circuitry and the memory circuitry are bundled together, such that the storage circuitry travels together with the memory circuitry. In the following, two scenarios are pre-sented -in one scenario, DIMMs are used, in the other scenario, HBM is used. However, other scenarios are possible as well.

Figs. 1c and 1d show schematic diagrams of examples of a computer system comprising an apparatus or device for persisting memory recovery actions. In Fig. 1c, the memory circuitry is part of a DIMM 20, with the DIMM further comprising an SPD hub 26 and the storage circuitry 24. In Fig. 1d, the memory circuitry is part of a HBM, with the HBM memory mod-ule further comprising an HBM controller 104 and the storage circuitry. In both cases, the storage circuitry is part of the same memory module as the memory circuitry. However, in the case of HBM, the memory is generally not separable from the CPU, in contrast to the memory circuitry included in a DIMM. Accordingly, the memory circuitry may be dynamic random-access memory of a removable memory module 20 or of a non-removable memory module. The processing circuitry may be configured to store the information on the memory recovery actions being taken by the memory controller to storage circuitry hosted by the re-movable or non-removable memory module. Figs. 2a (DIMM) and 2b (HBM) show the re-spective memory modules in isolation. Fig. 2a shows a schematic diagram of an example of a Dual Inline Memory Module, while Fig. 2b shows a schematic diagram of an example of a High Bandwidth Memory Module.

In case a removable memory module 20 is used, such as a DIMM, a storage circuitry that is included on the removable memory module 20 may be used. In the case of a DIMM, such storage circuitry may be accessed via the SPD hub. Accordingly, the processing circuitry may be configured to store the information on the memory recovery actions being taken by the memory controller using an SPD hub 26 that is also co-located with the memory circuitry. For example, the processing circuitry may be configured to store the information on the one or more memory actions that affect the memory circuitry of the respective DIMM using the storage circuitry of the respective DIMM.

Similarly, in case a non-removable memory module, such as HBM is used, a storage circuitry that is included on the non-removable memory module may be used. For example, the com-puter system further, and in particular the non-removable memory module, may comprise a HBM controller 104 that is co-located with the memory circuitry 22. In this case, the pro-cessing circuitry may be configured to store the information on the memory recovery actions being taken by the memory controller using the HBM controller.

After the respective restart, reset or DIMM transfer event, the memory recovery actions can be read back from the storage circuitry and replayed by the memory controller. The processing circuitry may be configured to load the information on the one or more memory recovery actions from the storage circuitry (e.g., after at least one of a power off event, a removable memory module migration event, a reset event, a memory persistent warm reset event and a suspend-to-random-access memory event) , and to configure the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry. Accord-ingly, as further shown in Fig. 1b, the method may comprise loading 150 the information on the one or more memory recovery actions from the storage circuitry and configuring 170 the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry. The processing circuitry may be configured to check the loaded one or more memory recovery actions using the stored bit error detection information or bit error recovery information, and to correct (i.e., recover) the loaded one or more memory recovery actions using the bit error recovery information if needed. For example, the processing cir-cuitry may be configured to instruct the memory controller the replay the one or more memory recovery actions loaded from the storage circuitry (using the parameters loaded from the stor-age circuitry) .

In some cases, it may make sense to re-evaluate the replay actions being taken. For example, in some cases, some memory circuitry may be removed before the computer system is re-started, or, in retrospect, a greater efficiency can be obtained by using a memory recovery action that affects an entire row or device of memory circuitry, instead of multiple memory recovery actions affecting portions of the respective row and device. Therefore, the processing circuitry may be configured to evaluate the loaded one or more memory recovery actions, and to alter a memory recovery action taken based on the evaluation. Accordingly, as further shown in Fig. 1b, the method comprises evaluating 160 the loaded one or more memory re-covery actions and altering 165 the memory recovery action taken based on the evaluation. In this case, instead of the loaded one or more memory actions, the altered memory recovery action may be applied by the memory controller. For example, the one or more memory re-covery actions may be evaluated with respect to efficiency and/or with respect to a change of memory circuitry being available.

The interface circuitry 12 or means for communicating 12 may correspond to one or more inputs and/or outputs for receiving and/or transmitting information, which may be in digital (bit) values according to a specified code, within a module, between modules or between modules of different entities. For example, the interface circuitry 12 or means for communi-cating 12 may comprise circuitry configured to receive and/or transmit information.

For example, the processing circuitry 14 or means for processing 14 may be implemented using one or more processing units, one or more processing devices, any means for processing, such as a processor, a computer or a programmable hardware component being operable with accordingly adapted software. In other words, the described function of the processing cir-cuitry 14 or means for processing may as well be implemented in software, which is then executed on one or more programmable hardware components. Such hardware components may comprise a general-purpose processor, a Digital Signal Processor (DSP) , a micro-con-troller, etc.

For example, the storage circuitry 16 or means for storing information 16 may comprise at least one element of the group of a computer readable storage medium, such as a magnetic or optical storage medium, e.g., a hard disk drive, a flash memory, Floppy-Disk, Random Access Memory (RAM) , Programmable Read Only Memory (PROM) , Erasable Programmable Read Only Memory (EPROM) , an Electronically Erasable Programmable Read Only Memory (EEPROM) , or a network storage.

For example, the computer system 100 may be a workstation computer system (e.g., a work-station computer system being used for scientific computation) or a server computer system, i.e., a computer system being used to serve functionality, such as the computer program, to one or client computers.

More details and aspects of the apparatus 10, device 10, method, computer program, computer system 100 and memory module 20 are mentioned in connection with the proposed concept or one or more examples described above or below (e.g., Fig. 3 to 12) . The apparatus 10, device 10, method, computer program, computer system 100 and memory module 20 may comprise one or more additional optional features corresponding to one or more aspects of the proposed concept, or one or more examples described above or below.

Various examples of the present disclosure relate to a concept, e.g., (software-based) method and apparatus, to implement RAS (Reliability, Availability, Serviceability) recovery actions replay, e.g., (Double Data Rate 5, DDR5) RAS recovery actions replay, for server and cloud systems.

The memory controller, e.g., the Integrated Memory Controller (iMC) integrated within the Central Processing Unit (CPU) , of a computer system generally implements CE (Corrected Error) counter/threshold per rank. When an error counter equals the error threshold, the memory controller sets that rank’s “corrected error threshold overflow” status bit (as it is likely that a persistent error has occurred) and signals an interrupt to the computer system’s system firmware (e.g., Basic Input/Output System, BIOS) , then the system firmware takes a suitable RAS (Reliability, availability, and serviceability) action to recover the system. An example of such a procedure is shown in Fig. 3. Fig. 3 shows a schematic diagram of a hard-ware rank corrected error counter and threshold (for channel 0 310) . Each time a drip occurs (i.e., a CE is detected) , the correctable error counter Corr_Err_cnt is incremented and com-pared 320; 330 with the correctable error threshold Corr_Err_threshld (for the specific rank, rank 0 in this example, with 8 ranks per channel) . The procedure is performed for every rank (and channel) . The result is saved in a per-rank “corrected error threshold overflow” . If one of the status bits is set to 1 (as compared by an OR-gate 340 for example) , an error interrupt is generated, and an appropriate RAS action 350, such as PCLS (Partial Cache Line Sparing) , PPR (Post Package Repair) , ADDDC (Adaptive Double Device Data Correction) or bank sparing is performed.

PPR is a DRAM (Dynamic Random Access Memory) feature that is based on mapping out bad rows with redundant rows. It can be used to improve a yield rate of DRAM manufacturing. However, the system firmware can also leverage this feature for boot time recovery. PPR is a new feature that has been introduced with DDR4. Fig. 4 shows a schematic diagram of the PPR workflow. When accessing the memory, the address being used to access the memory is provided both to the word line decoder 410 and to a comparator 440. The word line decoder 410 is connected to the rows of memory 420 via programmable devices 430. In case a faulty row is detected, the programmable device of that row (shown as solid black dots in Fig. 4) is set to block access to the respective rows, and the rows are replaced by redundant rows (indi-cated by the arrows in Fig. 4) . The redundant rows are accessed via the comparator 440, which uses a faulty row address memory 450 and a corresponding smaller decoder/multiplexer 460 to access the corresponding redundant rows. The content of the respective rows are then read out by the bit line decoder and sense amplifiers 470.

PCLS (illustrated in Fig. 5) is a feature that has been introduced recently on some server platforms. PCLS is a sparing technique that detects a single bit persistent fault within a cache-line and then replaces the entire nibble (4-bits) with spare capacity within the CPU (e.g., within the integrated memory controller, iMC) . Spare capacity may be added within the CPU (iMC) . For example, the spare capacity may be parity protected. In current implementations. up to 16 single DRAM nibbles can be replaced per memory channel. Fig. 5 shows a schematic diagram of a PCLS feature flow. Fig. 5 shows a CPU 510 with CHAs (Caching Home Agents) and cores 520 and two integrated memory controllers, iMC 0 530 and iMC 1 535. The com-puter system comprising the CPU further comprises DIMMs (Dual In-line Memory Modules) 540; 545 connected to the two integrated memory controllers. In DIMM1 connected to iMC0 530, a persistent bit error has occurred in one 550 of the devices of the DIMM. When the address (0xffvvv in this case) of the affected bit is read (1) and access to the DIMM returns an error (2) , a PCLS Spare Data Buffer (comprising multiple entries, with each entry com-prising an index, an address and the data being stored) if iMC0 540 is used to return (3) the data stored in the defect bit.

PCLS is mainly designed for HBM (High Bandwidth Memory) memory. In HBM, the HBM die is packaged with the CPU in the same socket. If a hard failure errors occurs on the HBM, without PCLS, the system firmware would have to disable the HBM channel or whole HBM based on failure locations. PCLS can provide recovery for these errors.

For server system, memory device failures-if not recovered -can result in service events, or even server crashes. As modern servers implement larger memory arrays, the likelihood of a memory device failure increases. Current studies indicate around 50%of hard failures are single bit failures. Any increase in weak bits due to scaling will also result in single bit fails. Autonomous driving systems may require CPUs with an improved reliability and stability, memory device failures-if not corrected-could lead to serious accidents.

ADDDC is an improved implementation of Double Device Data Correction (DDDC) in x4 DRAM. ADDDC is based on the insight, that, in DDDC, only 2 ECC (Error Checking and Correction) devices may be required to implement a chip-kill ECC. Four ECC devices are available in a lockstep configuration, two are idle until the first failure. ADDDC deals with the failed bank/rank sequentially. Virtual lockstep (VL) is implemented as intra channel lock-step. Fig. 6 shows a schematic diagram of the virtual lockstep used in DDDC. On the left, the cache line layout before DRAM sparing is shown (Fig. 6 showing rank x 610 and rank y 620 before sparing and rank x 630 and rank y 640 after sparing) with ECC spread over 18x4 devices, 1 DIMM. A failed bank is present in rank x. After sparing, a split cache line is used with reads of 32 B, DDDC: 36x4 devices over 2 regions (in virtual lockstep) . After sparing, the full line may be read from the high 32 B of rank y 640 and the low 32 B of rank x 630, for example.

Now, assuming that device 0 in bank 0/rank a fails, the system firmware gets invoked when CE threshold is reached. The system firmware may trigger the virtual lockstep and invoke spare copying routine. After these actions the two banks are now bound by active virtual lockstep. Further memory writes to the affected banks now have their codewords split across the two banks. This may be accomplished through two truncated (burst chop) transactions to each of the banks in an adaptive virtual lockstep set. The benefit of this arrangement is that the two cache lines involved in an adaptive virtual lockstep gang share the pain of dealing with the dead device and are able to retrieve the correct information in spite of the missing chip. Additional single bit errors can still be corrected the application of ECC. Figs. 7a and 7b show schematic diagrams of an ADDDC flow. As shown in Fig. 7a, a first strike occurs at rank A, device 0, bank 0 710 (in Fig. 7a, 7b) . In the example shown in Figs. 7a and 7b, spare device 17 in bank 0/rank A (720 in Fig. 7b) is used to replace the bad device 0 in bank 0/rank a after sparing copy. The failed bank is put in virtual lockstep (VLS) , then device 0, bank 0 is mapped out by copying content to bank 0 in spare DRAM device (17) and mapping out Device 0 afterwards. So, device 0 in bank 0/rank A is mapped out after the sparing copy is done.

In general, the system firmware takes the appropriate RAS recovery action (such as PPR, PCLS or ADDDC) to recover persistent errors (e.g., DDR5 persistent errors) when the CE (corrected Error) threshold is reached at runtime. For example, the system firmware may im-plement RAS action by programming the iMC RAS action registers and sending a pcode (portable code) command to the pcode (portable code machine) according to the error rec-orded at runtime. However, these register setting may be lost after S3 (Suspend-to-RAM) , MPWR (Memory Persistent Warm Reset) , system reset, and power off. So, in this case there is no RAS recovery action to provide system reliability after reset or power off. The persistent memory CE may still exist in the system after reset or power off, if any transient CE happens on another device when system accesses the device which has a persistent CE, it may become an UCE (uncorrectable error) , and the system may crash due to the UCE.

Fig. 8 shows a schematic diagram of how RAS actions are lost after S3, reset and power off. Fig. 8 shows a CPU 810 with CHAs/cores 820 and two iMCs 830; 835 that provide access to DIMMs 840; 845. As shown on the top of Fig. 8, Dev0 of DIMM0 is defect. The RAS action registers of iMC0 stores information on the RAS actions being taken with respect to this de-vice (e.g., PPR or ADDDC) . The persistent CE in Dev0 is thus recovered by RAS action/After, S3, MPWR, reset or power off, these settings are gone (as shown on the bottom of Fig. 8) . Thus, after S3, MPWR, Reset, Power Off, the persistent CE still exists in Dev0, but the RAS action is lost, and no RAS recovery action is taken to improve system reliability. If now a transient CE of Dev2 in the same cache line as the persistent CE in Dev0 occurs, an UCE occurs when the system accesses the cache line and the system crashes.

As a result, seamless MPWR might not be compatible with several RAS features. Seamless MPWR may be blocked after MPWR if the RAS action “ADDDC” or mirroring happens at runtime. For example, the OS (Operating System) might not be able to get the correct data from memory because the high half cache line has been swapped between failed rank/bank and buddy rank/bank after ADDDC, but the ADDDC setting in the iMC is lost after MPWR.

In Fig. 9, a different scenario is shown, where a DIMM is migrated from one system to another. If the DIMM is migrated from one system (system A, shown on top in Fig. 9) to another system (system B, shown on the bottom in Fig. 9) after the system firmware takes RAS re-covery action to recover DDR5 persistent error when CE (corrected Error) threshold reached at runtime, the error information and the recovery information are lost, and system B is not aware of the DIMM error info and error recovery action. As a result, there is no RAS recovery action to improve system reliability after DIMM migration. The persistent memory CE still exists in the system (System B) after DIMM migration. If any transient CE happens on another device when the system accesses the device which has persistent CE, it may become an UCE (uncorrectable error) , then system may crash due to the UCE.

Fig. 9 shows a schematic diagram illustrating how DIMM error information and RAS actions are lost after DIMM migration. As shown on the top of Fig. 8, Dev0 of DIMM0 is defect. The RAS action registers of iMC0 stores information on the RAS actions being taken with respect to this device (e.g., PPR or ADDDC) ., so the persistent CE in Dev0 is recovered by the RAS action. As shown in the bottom half, after DIMM migration, the persistent CE still exists in Dev0, but the RAS action is lost, so no RAS recovery action is performed to improve system reliability.

The proposed concept may provide a software functionality to implement a (DDR5) RAS recovery actions replay after MPWR, system reset or DIMM migration. Using the proposed functionality, the system software may be aware of DIMM error information and RAS memory recovery action after MPWR, system reset or DIMM migration by the proposed memory RAS action replay. The proposed concept may improve system reliability, availabil-ity and stability and reduce system crash rate by using RAS recovery action replay before uncorrectable errors occur. Additionally, the proposed concept may enable a coexistence of RAS features and seamless MPWR.

The system firmware (e.g., BIOS or UEFI (Unified Extensible Firmware Interface) , e.g., the apparatus 10 or device 10 introduced in connection with Figs. 1a to 1d) may implement the proposed concept by creating an RAS action list table (e.g., the information on the one or more recovery actions) with a predefined RAS action info structure. Additionally, it may cal-culate a Cyclic Redundancy Check code, CRC code (CRC is used to protect the RAS action data) , at runtime, then save both RAS action data and CRC value to (DDR5) SPD (Serial Presence Detect) user-programmable 10-15 Blocks (new blocks may be used for system man-agement during runtime) , so that the RAS action info can be retrieved after reset, power off or DIMM migration. The system firmware may read back the RAS action information and CRC value from the SPD and perform a CRC check after reset, power off or DIMM migration, and then execute a RAS action replay according to each RAS action execution (e.g., triggering the SMI (System Management Interface) , programing corresponding registers, sending com-mands to pCode to start sparing copy and notifying the OS of the RAS replay action via APEI (ACPI Platform Error Interfaces, with ACPI being the Advanced Configuration and Power Interface) ) .

Fig. 10 shows a schematic diagram of the proposed approach to implement RAS recovery action replay logic (e.g., for DDR5) . At runtime 1010, when the system firmware receives a CE threshold overflow interrupt, the system firmware (e.g., the apparatus 10 or device 10 introduced in connection with Figs. 1a to 1d) takes the RAS action 1020 according to the CE count at runtime. The system firmware further creates an RAS action list table and calculates the corresponding CRC, than saves it to the SPD user blocks of the (DDR5) DIMM 1030. After S3, MPWR, reset, power off or DIMM migration 1040, at next boot time 1050, the system firmware reads back the RAS action information from the DIMM SPD 1030 and does a CRC check, and executes the RAS action replay according to the SPD RAS action infor-mation. The proposed concept may thus provide a software memory RAS recovery action replay implementation after S3, system reset and power off and after DIMM migration.

The proposed concept may improve system reliability, availability and stability using a soft-ware functionality, thereby reducing system crash rate. The proposed concept may further improve DDR5 memory error handling coverage. The proposed concept is extensible and can be extended to HBM and the next generation DDR (DDR6+) , and to seamless OOB (Out-Of-Band) RAS if server/CSP uses the BMC (Baseband Management Controller) to handle memory CE. The proposed concept may enable RAS features to co-exist with seamless MPWR.

In some examples of the present disclosure, as shown in Fig. 11, the system firmware (e.g., the apparatus 10 or device 10 introduced in connection with Figs. 1a to 1d) may implement the proposed concept by creating a RAS action list table and calculate the CRC (CRC is used to protect the RAS action data) at runtime, then save both RAS action data and CRC value to the (DDR4) SPD user programmable 10～15 Blocks following a predesigned format, so that the RAS action info can be retrieved after reset, power off or DIMM migration. The system firmware may read back the RAS action info and CRC value from SPD and does CRC check after reset, power off or DIMM migration, then executes RAS action replay (trigger SMI, programing corresponding registers, send command to pCode to start sparing copy and notify OS the RAS replay action by APEI (ACPI platform error interfaces) ) .

Fig. 11 shows a more detailed schematic diagram of the proposed (DDR5) RAS recovery action replay logic implementation. The implementation starts at runtime (of the system firm-ware/BIOS) 1110, when the CPU triggers 1120 the system firmware via the system manage-ment interface after a CE threshold overflow. The system firmware takes 1130 the RAS action according to the CE error record at runtime and notifies the OS via the APEI. The system firmware creates 1140 the RAS action list table and calculates the CRC, then saves both RAS action data and CRC value to DDR5 SPD user blocks. After S3, MPWR, reset, power off or DIMM migration, at next system firmware boot time 1160, silicon initialization 1170 is per-formed, followed by RAS initialization 1180, which includes enabling 1182 memory ECC mode and setting the CE threshold, 1184 reading the RAS action data and CRC value from the SPD and performing the CRC check, and if the CRC is correct, calling 1186 the RAS action handler at post time to execute the RAS action replay. If CRC is not correct, the replay may be skipped.

Fig. 12 shows a table of an example of a RAS action data structure. As shown in Fig. 12, such a data structure may comprise an index field, an error record used by RAS action field (spec-ifying which RAS action is used for which rank, bank, device, row, and column (with an x indicating that the error information is not used by the respective RAS action) , and an RAS action type field (e.g., Runtime PPR, ADDDC etc. ) .

In general, saving and restoring of context is already a core part of OS sleep states. However, such context does not include RAS functionality. Moreover, such context does not persist over reset, restart, or DIMM changes. The proposed concept provides a software-based con-cept for implementing DDR5 RAS recovery actions replay after MPWR (memory persistent warm reset, a Seamless key feature) , system reset, power off and DIMM migration. Besides save/restore, the system firmware may create a RAS action list table with predefined new RAS action info structure and calculate a CRC (CRC is used to protect the RAS action data) at runtime, then save both RAS action data and CRC value, e.g., to the DDR5 SPD user-programmable 10～15 Blocks, so that the RAS action info can be retrieved after MPWR, reset, power off or DIMM migration. The system firmware may also read back RAS action info and CRC value from SPD and do a CRC check after reset, power off or DIMM migration, then execute RAS action replay according to each RAS action execution steps.

More details and aspects of the concept of RAS recovery actions replay are mentioned in connection with the proposed concept or one or more examples described above or below (e.g., Fig. 1a to 2b) . The concept of RAS recovery actions replay may comprise one or more additional optional features corresponding to one or more aspects of the proposed concept, or one or more examples described above or below.

In the following, some examples of the proposed concept are presented:

An example (e.g., example 1) relates to an apparatus (10) for persisting memory recovery actions, the apparatus comprising interface circuitry (12) , machine-readable instructions, and processing circuitry (14) to execute the machine-readable instructions to determine one or more memory recovery actions taken by a memory controller (102) with respect to memory circuitry (22) . The machine-readable instructions comprise instructions to store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory circuitry.

Another example (e.g., example 2) relates to a previously described example (e.g., example 1) or to any of the examples described herein, further comprising that the machine-readable instructions comprise instructions to load the information on the one or more memory recov-ery actions from the storage circuitry, and to configure the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry.

Another example (e.g., example 3) relates to a previously described example (e.g., example 2) or to any of the examples described herein, further comprising that the machine-readable instructions comprise instructions to load the information on the one or more memory recov-ery actions from the storage circuitry after at least one of a power off event, a removable memory module migration event, a reset event, a memory persistent warm reset event and a suspend-to-random-access memory event.

Another example (e.g., example 4) relates to a previously described example (e.g., one of the examples 2 to 3) or to any of the examples described herein, further comprising that the ma-chine-readable instructions comprise instructions to evaluate the loaded one or more memory recovery actions, and to alter a memory recovery action taken based on the evaluation.

Another example (e.g., example 5) relates to a previously described example (e.g., one of the examples 1 to 4) or to any of the examples described herein, further comprising that the memory circuitry is dynamic random-access memory of a removable memory module (20) , the machine-readable instructions comprising instructions to store the information on the memory recovery actions being taken by the memory controller to storage circuitry hosted by the removable memory module.

Another example (e.g., example 6) relates to a previously described example (e.g., example 5) or to any of the examples described herein, further comprising that the removable memory module is a Dual In-line Memory Module (DIMM) .

Another example (e.g., example 7) relates to a previously described example (e.g., one of the examples 1 to 6) or to any of the examples described herein, further comprising that the ma-chine-readable instructions comprise instructions to store the information on the memory re-covery actions being taken by the memory controller using a Serial Presence Detect (SPD) hub (26) co-located with the memory circuitry.

Another example (e.g., example 8) relates to a previously described example (e.g., one of the examples 1 to 7) or to any of the examples described herein, further comprising that the ma-chine-readable instructions comprise instructions to store the information on the memory re-covery actions being taken by the memory controller using a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry.

Another example (e.g., example 9) relates to a previously described example (e.g., one of the examples 1 to 8) or to any of the examples described herein, further comprising that the one or more memory recovery actions comprise one or more of a Post Package Repair (PPR) action, a Partial Cache Line Sparing (PCLS) action, and an Adaptive Double Device Data Correction (ADDDC) action.

Another example (e.g., example 10) relates to a previously described example (e.g., one of the examples 1 to 9) or to any of the examples described herein, further comprising that the machine-readable instructions comprise instructions to calculate bit error detection infor-mation or bit error recovery information for the information on the one or more memory re-covery actions, and to store the information on the one or more memory recovery actions with the bit error detection information or bit error recovery information.

Another example (e.g., example 11) relates to a previously described example (e.g., example 10) or to any of the examples described herein, further comprising that the machine-readable instructions comprise instructions to calculate a cyclic redundancy check code for each of the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the cyclic redundancy check code.

Another example (e.g., example 12) relates to a previously described example (e.g., one of the examples 1 to 11) or to any of the examples described herein, further comprising that the machine-readable instructions comprise instructions to determine parameters of the one or more memory recovery actions, and to store the information on the memory recovery actions with the parameters of the one or more memory recovery actions.

An example (e.g., example 13) relates to a computer system (100) comprising the apparatus (10) according to one of the examples 1 to 12 (or according to any other example) and the memory controller (102) .

Another example (e.g., example 14) relates to a previously described example (e.g., example 13) or to any of the examples described herein, further comprising that the computer system further comprises the memory circuitry (22) .

Another example (e.g., example 15) relates to a previously described example (e.g., one of the examples 13 to 14) or to any of the examples described herein, further comprising that the computer system further comprises a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry (22) , wherein the machine-readable instructions comprise instructions to store the information on the memory recovery actions being taken by the memory controller using the HBM controller.

An example (e.g., example 16) relates to an apparatus (10) for persisting memory recovery actions, the apparatus comprising processing circuitry (14) configured to determine one or more memory recovery actions taken by a memory controller (102) with respect to memory circuitry (22) . The processing circuitry is configured to store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory circuitry.

Another example (e.g., example 17) relates to a previously described example (e.g., example 16) or to any of the examples described herein, further comprising that the processing circuitry is configured to load the information on the one or more memory recovery actions from the storage circuitry, and to configure the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry.

Another example (e.g., example 18) relates to a previously described example (e.g., example 17) or to any of the examples described herein, further comprising that the processing circuitry is configured to load the information on the one or more memory recovery actions from the storage circuitry after at least one of a power off event, a removable memory module migra-tion event, a reset event, a memory persistent warm reset event and a suspend-to-random-access memory event.

Another example (e.g., example 19) relates to a previously described example (e.g., one of the examples 17 to 18) or to any of the examples described herein, further comprising that the processing circuitry is configured to evaluate the loaded one or more memory recovery ac-tions, and to alter a memory recovery action taken based on the evaluation.

Another example (e.g., example 20) relates to a previously described example (e.g., one of the examples 16 to 19) or to any of the examples described herein, further comprising that the memory circuitry is dynamic random-access memory of a removable memory module (20) , the processing circuitry being configured to store the information on the memory recovery actions being taken by the memory controller to storage circuitry hosted by the removable memory module.

Another example (e.g., example 21) relates to a previously described example (e.g., example 20) or to any of the examples described herein, further comprising that the removable memory module is a Dual In-line Memory Module (DIMM) .

Another example (e.g., example 22) relates to a previously described example (e.g., one of the examples 16 to 21) or to any of the examples described herein, further comprising that the processing circuitry is configured to store the information on the memory recovery actions being taken by the memory controller using a Serial Presence Detect (SPD) hub (26) co-lo-cated with the memory circuitry.

Another example (e.g., example 23) relates to a previously described example (e.g., one of the examples 16 to 22) or to any of the examples described herein, further comprising that the processing circuitry is configured to store the information on the memory recovery actions being taken by the memory controller using a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry.

Another example (e.g., example 24) relates to a previously described example (e.g., one of the examples 16 to 23) or to any of the examples described herein, further comprising that the one or more memory recovery actions comprise one or more of a Post Package Repair (PPR) action, a Partial Cache Line Sparing (PCLS) action, and an Adaptive Double Device Data Correction (ADDDC) action.

Another example (e.g., example 25) relates to a previously described example (e.g., one of the examples 16 to 24) or to any of the examples described herein, further comprising that the processing circuitry is configured to calculate bit error detection information or bit error re-covery information for the information on the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the bit error detection information or bit error recovery information.

Another example (e.g., example 26) relates to a previously described example (e.g., example 25) or to any of the examples described herein, further comprising that the processing circuitry is configured to calculate a cyclic redundancy check code for each of the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the cyclic redundancy check code.

Another example (e.g., example 27) relates to a previously described example (e.g., one of the examples 16 to 26) or to any of the examples described herein, further comprising that the processing circuitry is configured to determine parameters of the one or more memory recov-ery actions, and to store the information on the memory recovery actions with the parameters of the one or more memory recovery actions.

An example (e.g., example 28) relates to a computer system (100) comprising the apparatus (10) according to one of the examples 16 to 27 (or according to any other example) and the memory controller (102) .

Another example (e.g., example 29) relates to a previously described example (e.g., example 28) or to any of the examples described herein, further comprising that the computer system further comprises the memory circuitry (22) .

Another example (e.g., example 30) relates to a previously described example (e.g., one of the examples 28 to 29) or to any of the examples described herein, further comprising that the computer system further comprises a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry (22) , wherein the processing circuitry is configured to store the information on the memory recovery actions being taken by the memory controller using the HBM controller.

An example (e.g., example 31) relates to a device (10) for persisting memory recovery actions, the device comprising means for processing (14) configured to determine one or more memory recovery actions taken by a memory controller (102) with respect to memory cir-cuitry (22) . The means for processing is configured to store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory circuitry.

Another example (e.g., example 32) relates to a previously described example (e.g., example 31) or to any of the examples described herein, further comprising that the means for pro-cessing is configured to load the information on the one or more memory recovery actions from the storage circuitry, and to configure the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry.

Another example (e.g., example 33) relates to a previously described example (e.g., example 32) or to any of the examples described herein, further comprising that the means for pro-cessing is configured to load the information on the one or more memory recovery actions from the storage circuitry after at least one of a power off event, a removable memory module migration event, a reset event, a memory persistent warm reset event and a suspend-to-ran-dom-access memory event.

Another example (e.g., example 34) relates to a previously described example (e.g., one of the examples 32 to 33) or to any of the examples described herein, further comprising that the means for processing is configured to evaluate the loaded one or more memory recovery ac-tions, and to alter a memory recovery action taken based on the evaluation.

Another example (e.g., example 35) relates to a previously described example (e.g., one of the examples 31 to 34) or to any of the examples described herein, further comprising that the memory circuitry is dynamic random-access memory of a removable memory module (20) , the means for processing being configured to store the information on the memory recovery actions being taken by the memory controller to storage circuitry hosted by the removable memory module.

Another example (e.g., example 36) relates to a previously described example (e.g., example 35) or to any of the examples described herein, further comprising that the removable memory module is a Dual In-line Memory Module (DIMM) .

Another example (e.g., example 37) relates to a previously described example (e.g., one of the examples 31 to 36) or to any of the examples described herein, further comprising that the means for processing is configured to store the information on the memory recovery actions being taken by the memory controller using a Serial Presence Detect (SPD) hub (26) co-lo-cated with the memory circuitry.

Another example (e.g., example 38) relates to a previously described example (e.g., one of the examples 31 to 37) or to any of the examples described herein, further comprising that the means for processing is configured to store the information on the memory recovery actions being taken by the memory controller using a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry.

Another example (e.g., example 39) relates to a previously described example (e.g., one of the examples 31 to 38) or to any of the examples described herein, further comprising that the one or more memory recovery actions comprise one or more of a Post Package Repair (PPR) action, a Partial Cache Line Sparing (PCLS) action, and an Adaptive Double Device Data Correction (ADDDC) action.

Another example (e.g., example 40) relates to a previously described example (e.g., one of the examples 31 to 39) or to any of the examples described herein, further comprising that the means for processing is configured to calculate bit error detection information or bit error recovery information for the information on the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the bit error detec-tion information or bit error recovery information.

Another example (e.g., example 41) relates to a previously described example (e.g., example 40) or to any of the examples described herein, further comprising that the means for pro-cessing is configured to calculate a cyclic redundancy check code for each of the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the cyclic redundancy check code.

Another example (e.g., example 42) relates to a previously described example (e.g., one of the examples 31 to 41) or to any of the examples described herein, further comprising that the means for processing is configured to determine parameters of the one or more memory re-covery actions, and to store the information on the memory recovery actions with the param-eters of the one or more memory recovery actions.

An example (e.g., example 43) relates to a computer system (100) comprising the device (10) according to one of the examples 31 to 42 (or according to any other example) and the memory controller (102) .

Another example (e.g., example 44) relates to a previously described example (e.g., example 43) or to any of the examples described herein, further comprising that the computer system further comprises the memory circuitry (22) .

Another example (e.g., example 45) relates to a previously described example (e.g., one of the examples 43 to 44) or to any of the examples described herein, further comprising that the computer system further comprises a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry (22) , wherein the means for processing is configured to store the information on the memory recovery actions being taken by the memory controller using the HBM controller.

An example (e.g., example 46) relates to a method for persisting memory recovery actions, the method comprising determining (110) one or more memory recovery actions taken by a memory controller (102) with respect to memory circuitry (22) . The method comprises storing (140) information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory circuitry.

Another example (e.g., example 47) relates to a previously described example (e.g., example 46) or to any of the examples described herein, further comprising that the method comprises loading (150) the information on the one or more memory recovery actions from the storage circuitry and configuring (170) the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry.

Another example (e.g., example 48) relates to a previously described example (e.g., example 47) or to any of the examples described herein, further comprising that the method comprises loading (150) the information on the one or more memory recovery actions from the storage circuitry after at least one of a power off event, a removable memory module migration event, a reset event, a memory persistent warm reset event and a suspend-to-random-access memory event.

Another example (e.g., example 49) relates to a previously described example (e.g., one of the examples 47 to 48) or to any of the examples described herein, further comprising that the method comprises evaluating (160) the loaded one or more memory recovery actions and altering (165) a memory recovery action taken based on the evaluation.

Another example (e.g., example 50) relates to a previously described example (e.g., one of the examples 46 to 49) or to any of the examples described herein, further comprising that the memory circuitry is dynamic random-access memory of a removable memory module (20) , the method comprising storing (140) the information on the memory recovery actions being taken by the memory controller to storage circuitry hosted by the removable memory module.

Another example (e.g., example 51) relates to a previously described example (e.g., example 50) or to any of the examples described herein, further comprising that the removable memory module is a Dual In-line Memory Module (DIMM) .

Another example (e.g., example 52) relates to a previously described example (e.g., one of the examples 46 to 51) or to any of the examples described herein, further comprising that the method comprises storing the information on the memory recovery actions being taken by the memory controller using a Serial Presence Detect (SPD) hub (26) co-located with the memory circuitry.

Another example (e.g., example 53) relates to a previously described example (e.g., one of the examples 46 to 52) or to any of the examples described herein, further comprising that the method comprises storing (140) the information on the memory recovery actions being taken by the memory controller using a High Bandwidth Memory (HBM) controller (104) co-lo-cated with the memory circuitry.

Another example (e.g., example 54) relates to a previously described example (e.g., one of the examples 46 to 53) or to any of the examples described herein, further comprising that the one or more memory recovery actions comprise one or more of a Post Package Repair (PPR) action, a Partial Cache Line Sparing (PCLS) action, and an Adaptive Double Device Data Correction (ADDDC) action.

Another example (e.g., example 55) relates to a previously described example (e.g., one of the examples 46 to 54) or to any of the examples described herein, further comprising that the method comprises calculating (130) bit error detection information or bit error recovery in-formation for the information on the one or more memory recovery actions and storing (140) the information on the one or more memory recovery actions with the bit error detection in-formation or bit error recovery information.

Another example (e.g., example 56) relates to a previously described example (e.g., example 55) or to any of the examples described herein, further comprising that the method comprises calculating (130) a cyclic redundancy check code for each of the one or more memory recov-ery actions and storing (140) the information on the one or more memory recovery actions with the cyclic redundancy check code.

Another example (e.g., example 57) relates to a previously described example (e.g., one of the examples 46 to 56) or to any of the examples described herein, further comprising that the method comprises determining (120) parameters of the one or more memory recovery actions and storing (140) the information on the memory recovery actions with the parameters of the one or more memory recovery actions.

Another example (e.g., example 58) relates to a previously described example (e.g., one of the examples 46 to 57) or to any of the examples described herein, further comprising that the method is performed by a computer system (100) comprising the memory controller (102) .

Another example (e.g., example 59) relates to a previously described example (e.g., example 58) or to any of the examples described herein, further comprising that the method is per-formed by a computer system (100) comprising the memory controller (102) and the memory circuitry (22) .

Another example (e.g., example 60) relates to a previously described example (e.g., one of the examples 58 to 59) or to any of the examples described herein, further comprising that the method comprises storing (140) the information on the memory recovery actions being taken by the memory controller using a High Bandwidth Memory (HBM) controller (104) co-lo-cated with the memory circuitry (22) .

An example (e.g., example 61) relates to a non-transitory machine-readable storage medium including program code, when executed, to cause a machine to perform the method of one of the examples 46 to 60 (or according to any other example) .

An example (e.g., example 62) relates to a computer program having a program code for performing the method of one of the examples 46 to 60 (or according to any other example) when the computer program is executed on a computer, a processor, or a programmable hard-ware component.

An example (e.g., example 63) relates to a machine-readable storage including machine read-able instructions, when executed, to implement a method or realize an apparatus as claimed in any pending claim or shown in any example.

The aspects and features described in relation to a particular one of the previous examples may also be combined with one or more of the further examples to replace an identical or similar feature of that further example or to additionally introduce the features into the further example.

Examples may further be or relate to a (computer) program including a program code to exe-cute one or more of the above methods when the program is executed on a computer, proces-sor, or other programmable hardware component. Thus, steps, operations, or processes of different ones of the methods described above may also be executed by programmed comput-ers, processors, or other programmable hardware components. Examples may also cover pro-gram storage devices, such as digital data storage media, which are machine-, processor-or computer-readable and encode and/or contain machine-executable, processor-executable or computer-executable programs and instructions. Program storage devices may include or be digital storage devices, magnetic storage media such as magnetic disks and magnetic tapes, hard disk drives, or optically readable digital data storage media, for example. Other examples may also include computers, processors, control units, (field) programmable logic arrays ( (F) PLAs) , (field) programmable gate arrays ( (F) PGAs) , graphics processor units (GPU) , ap-plication-specific integrated circuits (ASICs) , integrated circuits (ICs) or system-on-a-chip (SoCs) systems programmed to execute the steps of the methods described above.

It is further understood that the disclosure of several steps, processes, operations, or functions disclosed in the description or claims shall not be construed to imply that these operations are necessarily dependent on the order described, unless explicitly stated in the individual case or necessary for technical reasons. Therefore, the previous description does not limit the execu-tion of several steps or functions to a certain order. Furthermore, in further examples, a single step, function, process, or operation may include and/or be broken up into several sub-steps, -functions, -processes or -operations.

If some aspects have been described in relation to a device or system, these aspects should also be understood as a description of the corresponding method. For example, a block, device or functional aspect of the device or system may correspond to a feature, such as a method step, of the corresponding method. Accordingly, aspects described in relation to a method shall also be understood as a description of a corresponding block, a corresponding element, a property or a functional feature of a corresponding device or a corresponding system.

As used herein, the term “module” refers to logic that may be implemented in a hardware component or device, software or firmware running on a processing unit, or a combination thereof, to perform one or more operations consistent with the present disclosure. Software and firmware may be embodied as instructions and/or data stored on non-transitory computer-readable storage media. As used herein, the term “circuitry” can comprise, singly or in any combination, non-programmable (hardwired) circuitry, programmable circuitry such as pro-cessing units, state machine circuitry, and/or firmware that stores instructions executable by programmable circuitry. Modules described herein may, collectively or individually, be em-bodied as circuitry that forms a part of a computing system. Thus, any of the modules can be implemented as circuitry. A computing system referred to as being programmed to perform a method can be programmed to perform the method via software, hardware, firmware, or com-binations thereof.

Any of the disclosed methods (or a portion thereof) can be implemented as computer-execut-able instructions or a computer program product. Such instructions can cause a computing system or one or more processing units capable of executing computer-executable instructions to perform any of the disclosed methods. As used herein, the term “computer” refers to any computing system or device described or mentioned herein. Thus, the term “computer-exe-cutable instruction” refers to instructions that can be executed by any computing system or device described or mentioned herein.

The computer-executable instructions can be part of, for example, an operating system of the computing system, an application stored locally to the computing system, or a remote appli- cation accessible to the computing system (e.g., via a web browser) . Any of the methods de-scribed herein can be performed by computer-executable instructions performed by a single computing system or by one or more networked computing systems operating in a network environment. Computer-executable instructions and updates to the computer-executable in-structions can be downloaded to a computing system from a remote server.

Further, it is to be understood that implementation of the disclosed technologies is not limited to any specific computer language or program. For instance, the disclosed technologies can be implemented by software written in C++, C#, Java, Perl, Python, JavaScript, Adobe Flash, C#, assembly language, or any other programming language. Likewise, the disclosed tech-nologies are not limited to any particular computer system or type of hardware.

Furthermore, any of the software-based examples (comprising, for example, computer-exe-cutable instructions for causing a computer to perform any of the disclosed methods) can be uploaded, downloaded, or remotely accessed through a suitable communication means. Such suitable communication means include, for example, the Internet, the World Wide Web, an intranet, cable (including fiber optic cable) , magnetic communications, electromagnetic com-munications (including RF, microwave, ultrasonic, and infrared communications) , electronic communications, or other such communication means.

The disclosed methods, apparatuses, and systems are not to be construed as limiting in any way. Instead, the present disclosure is directed toward all novel and nonobvious features and aspects of the various disclosed examples, alone and in various combinations and subcombi-nations with one another. The disclosed methods, apparatuses, and systems are not limited to any specific aspect or feature or combination thereof, nor do the disclosed examples require that any one or more specific advantages be present, or problems be solved.

Theories of operation, scientific principles, or other theoretical descriptions presented herein in reference to the apparatuses or methods of this disclosure have been provided for the pur-poses of better understanding and are not intended to be limiting in scope. The apparatuses and methods in the appended claims are not limited to those apparatuses and methods that function in the manner described by such theories of operation.

The following claims are hereby incorporated in the detailed description, wherein each claim may stand on its own as a separate example. It should also be noted that although in the claims a dependent claim refers to a particular combination with one or more other claims, other examples may also include a combination of the dependent claim with the subject matter of any other dependent or independent claim. Such combinations are hereby explicitly proposed, unless it is stated in the individual case that a particular combination is not intended. Further-more, features of a claim should also be included for any other independent claim, even if that claim is not directly defined as dependent on that other independent claim.

Claims

An apparatus (10) for persisting memory recovery actions, the apparatus comprising interface circuitry (12) , machine-readable instructions, and processing circuitry (14) to execute the machine-readable instructions to:

determine one or more memory recovery actions taken by a memory controller (102) with respect to memory circuitry (22) ; and

store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory cir-cuitry.
The apparatus according to claim 1, wherein the machine-readable instructions com-prise instructions to load the information on the one or more memory recovery ac-tions from the storage circuitry, and to configure the memory controller to use the loaded one or more memory recovery actions with respect to the memory circuitry.
The apparatus according to claim 2, wherein the machine-readable instructions com-prise instructions to load the information on the one or more memory recovery ac-tions from the storage circuitry after at least one of a power off event, a removable memory module migration event, a reset event, a memory persistent warm reset event and a suspend-to-random-access memory event.
The apparatus according to claim 2, wherein the machine-readable instructions com-prise instructions to evaluate the loaded one or more memory recovery actions, and to alter a memory recovery action taken based on the evaluation.
The apparatus according to claim 1, wherein the memory circuitry is dynamic ran-dom-access memory of a removable memory module (20) , the machine-readable in-structions comprising instructions to store the information on the memory recovery actions being taken by the memory controller to storage circuitry hosted by the re-movable memory module.
The apparatus according to claim 5, wherein the removable memory module is a Dual In-line Memory Module (DIMM) .
The apparatus according to claim 1, wherein the machine-readable instructions com-prise instructions to store the information on the memory recovery actions being taken by the memory controller using a Serial Presence Detect (SPD) hub (26) co-located with the memory circuitry.
The apparatus according to claim 1, wherein the machine-readable instructions com-prise instructions to store the information on the memory recovery actions being taken by the memory controller using a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry.
The apparatus according to claim 1, wherein the one or more memory recovery ac-tions comprise one or more of a Post Package Repair (PPR) action, a Partial Cache Line Sparing (PCLS) action, and an Adaptive Double Device Data Correction (ADDDC) action.
The apparatus according to claim 1, wherein the machine-readable instructions com-prise instructions to calculate bit error detection information or bit error recovery in-formation for the information on the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the bit error detection information or bit error recovery information.
The apparatus according to claim 10, wherein the machine-readable instructions comprise instructions to calculate a cyclic redundancy check code for each of the one or more memory recovery actions, and to store the information on the one or more memory recovery actions with the cyclic redundancy check code.
The apparatus according to claim 1, wherein the machine-readable instructions com-prise instructions to determine parameters of the one or more memory recovery ac-tions, and to store the information on the memory recovery actions with the parame-ters of the one or more memory recovery actions.
A computer system (100) comprising the apparatus (10) according to one of the claims 1 to 12 and the memory controller (102) .
The computer system according to claim 13, wherein the computer system further comprises the memory circuitry (22) .
The computer system according to claim 13, wherein the computer system further comprises a High Bandwidth Memory (HBM) controller (104) co-located with the memory circuitry (22) , wherein the machine-readable instructions comprise instruc-tions to store the information on the memory recovery actions being taken by the memory controller using the HBM controller.
An apparatus (10) for persisting memory recovery actions, the apparatus comprising processing circuitry (14) configured to:

determine one or more memory recovery actions taken by a memory controller (102) with respect to memory circuitry (22) ; and

store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory cir-cuitry.
A device (10) for persisting memory recovery actions, the device comprising means for processing (14) configured to:

determine one or more memory recovery actions taken by a memory controller (102) with respect to memory circuitry (22) ; and

store information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory cir-cuitry.
A method for persisting memory recovery actions, the method comprising:

determining (110) one or more memory recovery actions taken by a memory control-ler (102) with respect to memory circuitry (22) ; and

storing (140) information on the one or more memory recovery actions being taken by the memory controller to storage circuitry (24) being co-located with the memory cir-cuitry.
A non-transitory machine-readable storage medium including program code, when executed, to cause a machine to perform the method of claim 18