CN117093390A - System for processing faulty page, method for processing faulty page, and host device - Google Patents

System for processing faulty page, method for processing faulty page, and host device Download PDF

Info

Publication number
CN117093390A
CN117093390A CN202310491026.XA CN202310491026A CN117093390A CN 117093390 A CN117093390 A CN 117093390A CN 202310491026 A CN202310491026 A CN 202310491026A CN 117093390 A CN117093390 A CN 117093390A
Authority
CN
China
Prior art keywords
memory
page
host
fault
scalable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310491026.XA
Other languages
Chinese (zh)
Inventor
金钟民
奇亮奭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/845,679 external-priority patent/US12019503B2/en
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Publication of CN117093390A publication Critical patent/CN117093390A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

A system for processing a failed page, a method of processing a failed page, and a host device are provided. The system comprises: a host processor; a host memory connected to the host processor through a first memory interface; and an expandable memory pool connected to the host processor through a second memory interface, the second memory interface being different from the first memory interface, the host memory including instructions that, when executed by the host processor, cause the host processor to: detecting an error in a target page of a first memory device of the scalable memory pool; generating an interrupt in response to detecting the error; storing fault page information corresponding to a target page of the first memory device in a fault page log; and changing the state of the target page of the first memory device from the first state to the second state according to the fault page log.

Description

System for processing faulty page, method for processing faulty page, and host device
The present application claims priority and benefit from U.S. provisional application No. 63/343410 entitled "UE (uncorrectable error) handling for CXL memory" filed 5/18 of 2022, the entire contents of which are incorporated herein by reference.
Technical Field
One or more embodiments of the present disclosure relate to scalable memory, and more particularly, to error handling for scalable memory.
Background
Machine Check Exception (MCE) is a type of hardware error that occurs when a Central Processing Unit (CPU) of a system detects an error in memory, I/O devices, a system bus, the processor itself, and the like. A correctable memory error is typically a single-bit (single-bit) error that is correctable by the system and typically does not cause system failure or data corruption. On the other hand, uncorrectable memory errors are typically multi-bit (multi-bit) errors that indicate some critical or fatal events in the memory itself that can typically be caused by some failure in a memory module (e.g., a memory chip) that is uncorrectable by software/firmware.
The above information disclosed in this background section is for enhancement of understanding of the background of the present disclosure and thus it may contain information that does not form the prior art.
Disclosure of Invention
One or more embodiments of the present disclosure are directed to error handling for scalable memory, and more particularly, to generating fault page information that can be persistently stored and used to automatically offline a fault page of a scalable memory device.
One or more embodiments of the present disclosure are directed to sharing failed page information among multiple host devices to automatically take a failed page of an expandable memory device offline.
In accordance with one or more embodiments of the present disclosure, a system for processing a failed page includes: a host processor; a host memory connected to the host processor through a first memory interface; and an expandable memory pool connected to the host processor through a second memory interface, the second memory interface being different from the first memory interface, the host memory including instructions that, when executed by the host processor, cause the host processor to: detecting an error in a target page of a first memory device of the scalable memory pool; generating an interrupt in response to detecting the error; storing fault page information corresponding to a target page of the first memory device in a fault page log; and changing the state of the target page of the first memory device from the first state to the second state according to the fault page log.
In one embodiment, the second memory interface may comprise: peripheral component interconnect express (PCIe) interfaces and computing express link (CXL) interconnects.
In one embodiment, the first memory interface may comprise: dual Inline Memory Modules (DIMMs).
In one embodiment, the scalable memory pool may include at least two different types of computing fast link (CXL) memory devices.
In one embodiment, the instructions may further cause the host processor to: performing a restart; reading the fault page log to identify one or more fault pages of the scalable memory pool; and setting a second state of the one or more failed pages based on the failed page log.
In one embodiment, the instructions may further cause the host processor to: receiving a log request for a fault page log from a guest host processor, the guest host processor configured to access an extensible memory pool; and in response to the log request, sending a fault page log to the guest host processor. The guest host processor may be configured to: a second state of one or more pages of the scalable memory pool is set based on the fault page log.
In one embodiment, the instructions may further cause the host processor to: receiving an update from a first guest host processor, the first guest host processor detecting an error in a second memory device in the scalable memory pool; identifying a failed page of the second memory device based on the update; updating a fault page log; and setting a second state of the failed page of the second memory device based on the updated failed page log.
In one embodiment, the instructions may further cause the host processor to: the updated fault page log is broadcast to a second guest host processor configured to access the scalable memory pool.
In one embodiment, the error may be a multi-bit error in a target page of a first memory device of the scalable memory pool, and the failed page information may include physical device information of the target page of the first memory device.
In accordance with one or more embodiments of the present disclosure, a method of processing a failed page includes: detecting, by a kernel of a first host device, an error in a target page of a first memory device of the scalable memory pool; generating, by the kernel, an interrupt in response to detecting the error; storing, by a device driver corresponding to the first memory device, fault page information corresponding to a target page of the first memory device in a fault page log; and changing, by a Fault Page Log (FPL) daemon, a state of a target page of the first memory device from a first state to a second state according to the fault page log.
In one embodiment, a first memory device of the scalable memory pool may be connected to a first host device via a peripheral component interconnect express (PCIe) interface and a computing fast link (CXL) interconnect.
In one embodiment, the scalable memory pool may include at least two different types of computing fast link (CXL) memory devices.
In one embodiment, the method may further comprise: performing a reboot by the kernel; reading, by the FPL daemon, the fault page log to identify one or more fault pages of the scalable memory pool; and setting, by the FPL daemon, a second state of the one or more failed pages according to the failed page log.
In one embodiment, the method may further comprise: receiving, by the FPL daemon of the first host device, a log request for a fault page log from the FPL daemon of the second host device, the second host device configured to access the scalable memory pool; and transmitting, by the FPL daemon of the first host device, the fault page log to the second host device in response to the log request. The FPL daemon of the second host device may be configured to: a second state of one or more failed pages of the scalable memory pool is set based on the failed page log.
In one embodiment, the method may further comprise: in response to the second host device detecting an error in the second memory device in the scalable memory pool, receiving, by the FPL daemon of the first host device, an update from the FPL daemon of the second host device; identifying, by the FPL daemon of the first host device, a failed page of the second memory device according to the update; updating, by the FPL daemon of the first host device, the fault page log; and setting, by the FPL daemon of the first host device, a second state of the failed page of the second memory device based on the updated failed page log.
In one embodiment, the method may further comprise: the updated fault page log is broadcast by the FPL daemon of the first host device to a third host device configured to access the scalable memory pool.
In one embodiment, the error may be a multi-bit error in a target page of a first memory device of the scalable memory pool, and the failed page information may include physical device information of the target page of the first memory device.
According to one or more embodiments of the present disclosure, a host device includes: a root complex connected to the scalable memory pool through a memory interface and configured to parse packets received from a memory device of the scalable memory pool; a core configured to detect an error bit corresponding to a failed page of the first memory device from the parsed packet and generate an interrupt in response to detecting the error bit; a driver of the first memory device configured to store, in response to the interrupt, fault page information corresponding to a fault page of the first memory device in a fault page log; and a Fault Page Log (FPL) daemon configured to change a state of the fault page from a first state to a second state based on the fault page log.
In one embodiment, the scalable memory pool may include at least two different types of computing fast link (CXL) memory devices.
In one embodiment, in response to a restart, the FPL daemon may be configured to: the fault page log is read to identify one or more fault pages of the scalable memory pool, and a second state of the one or more fault pages is set according to the fault page log.
Drawings
The above and other aspects and features of the present disclosure will be more clearly understood from the following detailed description of illustrative, non-limiting embodiments with reference to the accompanying drawings.
FIG. 1 is a schematic block diagram of an expandable memory system in accordance with one or more embodiments of the present disclosure.
Fig. 2 is a schematic block diagram of a host device of an expandable memory system in accordance with one or more embodiments of the present disclosure.
FIG. 3 is a flow diagram of a method of generating a fault page log for an expandable memory device in accordance with one or more embodiments of the present disclosure.
FIG. 4 is a flow chart of a method of taking a failed page of an expandable memory device offline after a system restart.
Fig. 5 is a schematic block diagram of an expandable memory system in accordance with one or more embodiments of the present disclosure.
FIG. 6 is a flow diagram of a method of sharing failed page information of an expandable memory device in accordance with one or more embodiments of the present disclosure.
FIG. 7 is a schematic block diagram of an expandable memory system in accordance with one or more embodiments of the present disclosure.
FIG. 8 is a flow diagram of a method of updating fault page information of an expandable memory device in accordance with one or more embodiments of the present disclosure.
Detailed Description
Hereinafter, embodiments will be described in more detail with reference to the drawings, wherein like reference numerals denote like elements throughout. This disclosure may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided as examples so that this disclosure will be thorough and complete, and will fully convey the aspects and features of the disclosure to those skilled in the art. Thus, processes, elements and techniques not necessary for a person of ordinary skill in the art to fully understand aspects and features of the present disclosure may not be described. Unless otherwise indicated, like reference numerals refer to like elements throughout the drawings and written description, and thus, their redundant description may not be repeated.
In general, because expandable memory may not typically be used for critical pages of operation of the system (such as kernel pages, execution pages, non-relocatable pages, etc.), uncorrectable errors (e.g., multi-bit errors) in expandable memory may have less impact on system stability. Thus, when an uncorrectable error occurs in a page of expandable memory, the likelihood of a system crash may be low, and thus, the page may simply be taken offline (offline) (e.g., the state of the page may be changed from a first state (e.g., online or available) to a second state (e.g., offline or unavailable) such that any applications and/or processes accessing the page may be forced to be shut down (e.g., may be terminated).
Typically, a system main memory (e.g., host memory) may be used for critical pages of operation of the system (such as kernel pages, application execution pages, non-relocatable pages, etc.), as well as pages for processing data (such as text pages, file pages, anonymous pages, removable pages, etc.). Thus, to ensure system stability and prevent system crashes from occurring, when an uncorrectable error occurs in a page of system main memory, the system may be shut down so that a user (e.g., an administrator) may replace the memory device (e.g., a Dynamic Random Access Memory (DRAM) device or chip) in which the uncorrectable error occurred. Thus, an error log, such as a Machine Check Exception (MCE) log, may simply contain error messages and some basic information so that a user can replace the system main memory device where an uncorrectable error occurred.
On the other hand, expandable memory may not be used for critical pages in general, but may be used only for processing data. Thus, unlike the case of system main memory, when an uncorrectable error occurs in a page of expandable memory, the page may be taken offline (e.g., the page may become unavailable) and any applications and/or processes accessing the page of expandable memory may be forced to be closed (e.g., may be terminated) while other pages of expandable memory may continue to be used. Thus, unlike the system main memory, which can be replaced when an uncorrectable error occurs therein, the usable lifetime of the scalable memory device can be prolonged or maximized by taking the failed page offline, and thus, the cost can be reduced.
However, in general, the system processor (e.g., host processor) may not maintain an offline page of expandable memory, and thus, when the system is restarted, the memory mapping to the failed page may be repeated. Further, since error logs typically do not include physical device information (e.g., device serial number, device type, and device physical address) for failed pages of the scalable memory, offline pages may not be shared among different host processors. For example, because different systems may have different system mappings, the error log of one host processor may be independent of another host processor. Thus, even if the failed page is taken offline through another host processor, a different host processor may continue to memory map the failed page to the scalable memory.
According to one or more embodiments of the present disclosure, even after a system reboot, fault page information for a fault page in an expandable memory device may be generated and persistently stored in a Fault Page Log (FPL) such that the FPL may be used to automatically offline any fault page of the expandable memory device before a memory mapping to the fault page may occur. Thus, the user experience may be improved and costs may be reduced.
In accordance with one or more embodiments of the present disclosure, the FPL may include at least physical device information (e.g., device serial number, device type, device physical address, etc.) of the failed page that may remain relatively consistent compared to the logical address that may be changed based on the memory map of the system. Thus, even after a system restart, hardware change, etc. where the host physical address may be changed, the FPL may be used to automatically take the failed page offline before any memory mapping to the failed page may occur.
For example, the host physical address may be changed when the expandable memory is inserted into a different slot of a host device, or when the expandable memory expansion card is inserted into a slot of a different host device. In this case, when such a hardware change is made, the FPL including at least the physical device information of the expandable memory can realize remapping from the device physical address to the host physical address. In some embodiments, when no hardware changes are made, the host physical address may be reused to take the failed page offline.
In accordance with one or more embodiments of the present disclosure, because the FPL may include at least physical device information of the failed page, the FPL may be shared among multiple host devices (e.g., host processors) such that each of the multiple host processors may take the failed page offline in their own system map, and thus, any memory mapping to the failed page may be avoided. Thus, system reliability may be improved and user experience may be improved.
The above and/or other aspects and features of the present disclosure will be described in more detail below with reference to the accompanying drawings.
FIG. 1 is a schematic block diagram of an expandable memory system in accordance with one or more embodiments of the present disclosure.
Referring to fig. 1, a host device 102 may include a host Operating System (OS)/kernel 104, a host processor 106, a host memory 108, and a storage device 110. The host operating system/kernel 104 may include system software for providing interfaces between hardware and users, as well as between software applications and hardware. For example, the host operating system/kernel 104 may be configured for resource allocation, memory management, CPU management, file management, execution of processes, etc. of the host device 102. For example, in some embodiments, host operating system/kernel 104 may comprise a Linux operating system/kernel, but the present disclosure is not so limited, and host operating system/kernel 104 may comprise any suitable operating system/kernel known to those of skill in the art (such as, for example, windows OS, apple OS (e.g., macOS), chrome OS, etc.).
The host processor 106 may be a processing circuit (such as, for example, a general purpose processor or a Central Processing Unit (CPU) core) of the host device 102. The host processor 106 may be connected to other components via an address bus, a control bus, a data bus, etc. Host processor 106 may execute instructions stored in host memory 108 to perform various operations described herein. For example, the host processor 106 may execute one or more system processes and background processes (described in more detail below) that may be copied from persistent storage (e.g., storage 110, read Only Memory (ROM), etc.) to the host memory 108 as needed or desired (e.g., at startup, execution time, interrupt routine, etc.).
Host memory 108 may be considered a high performance main memory (e.g., primary memory) of host device 102. For example, in some embodiments, host memory 108 may include (or may be) volatile memory (e.g., such as Dynamic Random Access Memory (DRAM)) that may be directly connected to memory slots of a motherboard of host device 102 via first memory interface 112. In this case, the first memory interface 112 (e.g., connector and its protocol) may include (or may conform to) a dual in-line memory module (DIMM) to facilitate communication between the host memory 108 and the host processor 106 (e.g., via the host OS/kernel 104) such that the host memory 108 may be a DIMM memory connected to a DIMM socket of the host device 102. However, the present disclosure is not so limited, and host memory 108 may include (or be) any suitable high-performance main memory (e.g., main memory) replacement for host device 102 as would be known to one of skill in the art. For example, in other embodiments, host memory 108 may be relatively high performance non-volatile memory (such as NAND flash memory, phase Change Memory (PCM), resistive RAM, spin Transfer Torque RAM (STTRAM), any suitable memory based on PCM technology, memristor technology, and/or resistive random access memory (ReRAM), and may include, for example, chalcogenides, and the like.
The storage 110 may be considered a secondary memory (e.g., secondary storage device (secondary storage)) that may persistently store data accessible by the host device 102. In this context, storage 110 may include (or may be) relatively slower memory when compared to the high performance main memory of host memory 108. For example, in some embodiments, storage 110 may be a Solid State Drive (SSD). However, the present disclosure is not limited thereto, and in other embodiments, the storage 110 may include (or may be) any suitable storage device (e.g., such as, for example, a magnetic storage device (e.g., a Hard Disk Drive (HDD), etc.), an optical storage device (e.g., a blu-ray disc drive, a Compact Disc (CD) drive, a Digital Versatile Disc (DVD) drive, etc.), other kinds of flash memory devices (e.g., a USB flash drive, etc.), etc. In various embodiments, the storage device 110 may conform to a large form factor standard (e.g., a 3.5 inch hard drive form factor), a small form factor standard (e.g., a 2.5 inch hard drive form factor), an m.2 form factor, an e1.s form factor, and the like. In other embodiments, the storage device 110 may conform to any suitable or desired derivative of these form factors.
The storage device 110 may be connected to the host processor 106 via a storage interface. The storage interface may facilitate communication (e.g., via the host OS/kernel 104) (e.g., using connectors and protocols) between the host processor 106 and the storage device 110. In some embodiments, the storage interface may facilitate exchange of storage requests and responses between host processor 106 and storage device 110. In some embodiments, the storage interface may facilitate transfer of data by the storage device 110 to the host memory 108 of the host device 102 and transfer of data from the host memory 108 of the host device 102. For example, in various embodiments, the storage interface (e.g., connector and its protocols) may include (or may conform to) Small Computer System Interface (SCSI), non-volatile memory express (NVMe), peripheral component interconnect express (PCIe), remote direct memory access over ethernet (RDMA), serial Advanced Technology Attachment (SATA), fibre channel, serial Attached SCSI (SAS), NVMe over network (NVMe-orf), and the like. In other embodiments, the storage interface (e.g., connector and its protocols) may include (or may conform to) various general-purpose interfaces (e.g., such as ethernet, universal Serial Bus (USB), etc.).
Still referring to fig. 1, the host device 102 is connected to the scalable memory pool 114 via a second memory interface 116 that is different from the first memory interface 112. The scalable Memory pool 114 may include one or more scalable Memory devices 118a, 118b, and 118c (collectively referred to as 118) (such as, for example, one or more computing fast link (CXL) Memory devices 118a (cxl_memory 1), 118b (cxl_memory 2), and 118c (cxl_memory 3)). In some embodiments, scalable memory pool 114 may be a split CXL memory pool including a plurality of different types of CXL memory devices 118a, 118b and 118c, which may typically include volatile memory (such as, for example, DDR3, DDR4, DDR5, low power, high power, low profile, PMEM, HBM, DRAM among SSDs with DRAM, etc.). However, the present disclosure is not so limited, and similar to the examples described above for host memory 108, scalable memory pool 114 may include (or may be) any suitable high-performance scalable memory for host device 102 as would be known to one of skill in the art. For example, the scalable memory pool 114 can include at least two different types of CXL memory devices.
In some embodiments, the second memory interface 116 may include a peripheral component interconnect express (PCIe) interface and a CXL interconnect. For example, the second memory interface 116 (e.g., connectors and protocols thereof) may include (e.g., may conform to) a CXL interconnect built on peripheral component interconnect express (PCIe) to facilitate communications (e.g., via the host OS/kernel 104) between the host processor 106 and the memory devices 118a, 118b, and 118c of the scalable memory pool 114. In this case, each of the memory devices 118a, 118b, and 118c may be connected to a PCIe slot of the host device 102 as a PCIe device. In other embodiments, the second memory interface 116 (e.g., connector and its protocols) may include (or may conform to) various general purpose interfaces (e.g., such as ethernet, universal Serial Bus (USB), etc.). Although fig. 1 shows one host device 102 connected to the scalable memory pool 114, the present disclosure is not limited thereto, and a plurality of host devices 102 may be connected to the scalable memory pool 114 (see, e.g., fig. 5 and 7).
As described above, both host memory 108 and scalable memory pool 114 may be used as high-performance main memory (e.g., main memory) of host device 102 such that they are both available for data processing (e.g., for temporarily storing data to be processed by host processor 106 (e.g., such as text pages, anonymous pages, file pages, removable pages, etc. (see, e.g., fig. 2))). However, while host memory 108 may also be used for critical pages (such as OS kernel pages, application execution pages, non-relocatable pages, etc.), the scalable memory pool 114 may not be used for such critical pages. Thus, when an uncorrectable error occurs in a page of host memory 108, host memory 108 may be replaced to prevent or substantially prevent a system crash, whereas unlike host memory 108, when an uncorrectable error occurs in a page of scalable memory device 118 of scalable memory pool 114, the page may simply be taken offline (e.g., the state of the page may be changed from a first state (e.g., online state or available state) to a second state (e.g., offline state or unavailable state)), such that any processes and/or applications accessing the page may be terminated (e.g., forced shut down). In other words, rather than being replaced as with host memory 108, when an uncorrectable error occurs in a page of scalable memory pool 114, the page may be spooled as understood by one of ordinary skill in the art such that the failed page is no longer available for memory mapping. Thus, the usable lifetime of the scalable memory device 118 of the scalable memory pool 114 may be increased or maximized, and thus, the cost may be reduced. As used herein, an uncorrectable error may represent a multi-bit error (e.g., a 2-bit error) on the same cache line that is uncorrectable by system software/firmware (e.g., may be uncorrectable), while a correctable error may represent a single-bit error and is typically corrected by the system.
It should be noted that the correctable errors in the pages of the scalable memory pool 114 may be handled in substantially the same or substantially the same manner as the correctable errors that may occur in the pages of host memory 108. For example, when a correctable error occurs, data of a page having the correctable error may be migrated to another page, and the page having the correctable error may be taken offline as understood by one of ordinary skill in the art, such that any application or process accessing the failed page may be remapped to the migrated page. However, in some embodiments, the embodiments described in more detail below may also be extended to soft offline pages (e.g., by persistently storing the fault page information for the soft offline pages in a fault page log). In this case, the fault page log may be used (e.g., at system 100 start-up (or after a restart)) to take pages offline as needed or desired. For convenience, embodiments described in more detail below may be described in the context of a spool page responsive to an uncorrectable error (e.g., a two-bit error on the same cache line), but the disclosure is not so limited, and at least some of the embodiments described herein may also be applicable to spool pages responsive to a correctable error.
Fig. 2 is a schematic block diagram of a host device of an expandable memory system in accordance with one or more embodiments of the present disclosure.
Referring to fig. 2, the host device 102 is connected to the scalable memory device 118 of the scalable memory pool 114 via the second memory interface 116. For example, in some embodiments, the scalable memory device 118 may be connected (e.g., via the second memory interface 116) to a port (e.g., PCIe port) of the root complex 202 of the host device 102, such that the scalable memory device 118 may be considered a PCIe device. In this case, the root complex 202 may connect the processor 206 (e.g., the host processor 106 in fig. 1) to the scalable memory device 118 to generate a transaction request to the scalable memory device 118 on behalf of the processor 206. The root complex 202 may be implemented as an integrated circuit or the functionality of the root complex 202 may be implemented as part of the processor 206 (e.g., as instructions stored in the memory 208 and executed by the processor 206).
In short, when an uncorrectable error occurs in a target page of the scalable memory device 118, the target register of the scalable memory device may generate an error bit (e.g., a poison bit) in a target packet (packet) (e.g., in a translation layer packet) and provide the target packet to the root complex 202. The root complex 202 parses the target packet and sends the parsed target packet including the error bits to the processor 206 (e.g., host processor 106). The processor 206 generates an interrupt based on the error bit and persistently stores fault page information including at least physical device information (e.g., device serial number, device type, device physical address, etc.) of the target page in the fault page log FPL 222. The processor 206 may take the target page offline in accordance with the FPL 222 and may terminate any process or application accessing the target page. The FPL 222 may be persistently stored in persistent storage 218 (e.g., storage 110 in FIG. 1, etc.) such that the failed page information is maintained in the FPL 222 even after a system reboot. Thus, in some embodiments, when the system is restarted, the FPL 222 may be read for each of the scalable memory devices 118 in the scalable memory pool 114 such that the failed pages identified in the FPL 222 may be automatically taken offline before any memory mapping to the failed pages occurs, and thus, error logs (e.g., MCE logs) may be reduced and user experience may be improved.
In more detail, the host device 102 may include a root complex 202, processing circuitry 204, and a persistent storage 218 (e.g., storage 110, etc.). The root complex 202 may connect the processing circuitry 204 (e.g., via a local bus, etc.) to the scalable memory device 118 via the second memory interface 116. For example, as discussed above, the second memory interface 116 (e.g., connector and its protocol) may include (e.g., may conform to) a CXL interconnect established over a peripheral component interconnect express (PCIe), such that the scalable memory device 118 may be a PCIe device connected to a PCIe port of the root complex 202. Although fig. 2 shows root complex 202 as being separate from processing circuitry 204, the present disclosure is not so limited, and in some embodiments root complex 202 may be implemented as part of processing circuitry 204 (e.g., as an integrated circuit or as part of processor 206).
The processing circuitry 204 includes one or more processors 206 (e.g., which may include the host processor 106 in fig. 1) and memory 208 (e.g., host memory 108, ROM, etc.). The processing circuitry 204 may be connected to (or may include) the root complex 202 such that the processing circuitry 204 and its various components may send and receive data with the scalable memory device 118 via the root complex 202. The processor 206 may be implemented with a general purpose processor such as a central processing unit (e.g., CPU), application Specific Integrated Circuit (ASIC), one or more Field Programmable Gate Arrays (FPGAs), a Digital Signal Processor (DSP), a set of processing components, or other suitable electronic processing components capable of executing instructions (e.g., via firmware and/or software). The processing circuitry 204 and the processor 206 may be housed in a single geographic location or device, or may be distributed across various geographic locations or devices.
Memory 208 (e.g., one or more memory devices and/or memory units) may include tangible, non-transitory, volatile memory, or non-volatile memory such as RAM (e.g., DRAM), ROM, NVRAM, or flash memory. The memory 208 may be communicatively connected to the processor 206 via the processing circuitry 204 and include data and/or computer code for facilitating at least some of the various processes described herein (e.g., by the processing circuitry 204 and/or the processor 206). For example, memory 208 may include database components, object code components, script components, and/or any other type of information or data structures for supporting the various activities and information or data structures described in the present application. The memory 208 stores instructions or programming logic that, when executed by the processor 206, control various operations of the host device 102 described herein.
As shown in fig. 2, memory 208 may include an OS kernel 210, a Machine Check Exception (MCE) log daemon 212, a device driver 214, and a Fault Page Log (FPL) daemon 216 that may correspond to various different instructions that may be copied from persistent storage (e.g., storage 110, ROM, etc.) to memory 208 as needed or desired (e.g., at execution time, after an interrupt, after a restart, etc.). For example, the OS kernel 210 may include various system software to provide an interface between software applications and hardware (e.g., CPU, memory, storage, etc.), and the device driver 214 may include a device driver for each of the scalable memory devices 118 of the scalable memory pool 114. MCE log daemon 212 and FPL daemon 216 may include (or may be) various background processes that may be invoked, for example, in response to an interrupt or after a restart.
The OS kernel 210 may detect machine check anomalies (MCEs) from various hardware, such as from the host processor 106, host memory 108, storage device 110, scalable memory device 118, etc., and may provide some error information to a user (e.g., a system administrator) via an error log or system console. In the event that the MCE corresponds to an uncorrectable error detected from a page in host memory 108, if the page is critical to the system, OS kernel 210 may shut down the system to prevent a system crash, and in this case, typically nothing can be recorded. In some embodiments, MCE log daemon 212, which may be a third party user application, may also be included to provide some additional information about the detected MCE (e.g., host physical address (if supported), memory mapping information, etc.), and may store the additional information in, for example, MCE log 220. However, the MCE log 220 is primarily for the host memory 108, and since the scalable memory device 118 may simply be considered a memory expansion attached to the PCIe slot of the host device 102, the MCE log 220 may not record complete information about uncorrectable errors in the failed page of the scalable memory device 118. In other words, the MCE log 220 may not include failed page information (e.g., physical device information) of the scalable memory. Thus, the MCE log 220 may simply store information identifying the memory devices of the host memory 108 that need replacement, which may include the physical address of the failed page of the host memory 108 (if supported), but since all of the scalable memory devices 118 attached to the CXL/PCIe/network may be considered memory extensions, the MCE log 220 may not be sufficient to store uncorrectable erroneous failed page information in the scalable memory devices 118 of the scalable memory pool 114.
According to one or more embodiments of the present disclosure, when the MCE corresponds to an uncorrectable error detected from a page in the scalable memory device 118, the failed page information (e.g., physical device information (such as device serial number, device type, device physical address, etc.) may be persistently stored in the FPL 222 and may be used to automatically offline the failed page in the scalable memory device 118 even in the event of a hardware configuration change and/or server change. For example, if the scalable memory device 118 moves from slot 1 to slot 2, the Hardware Device Management (HDM) range may be changed, and such changes may not be tracked by the scalable memory device 118. On the other hand, because the FPL 222 persistently stores failed page information, even in the event of such a change, such information may be used to offline the failed page in the scalable memory device 118.
In more detail, when the OS kernel 210 detects an MCE corresponding to a target page of the scalable memory device 118 (e.g., based on parsed error bits from the root complex 202), the OS kernel 210 may generate an interrupt of an application or process accessing the target page of the scalable memory device 118 and may invoke the device driver 214 of the scalable memory device 118 to process the interrupt. The device driver 214 of the scalable memory device 118 may include an Advanced Error Reporting (AER) processor to process MCEs detected in the scalable memory device 118 in response to interrupts. For example, if the MCE corresponds to an uncorrectable error (e.g., a 2-bit error on the same cache line) for a target page in the scalable memory device 118, the AER processor of the scalable memory device 118 may generate and may persistently store in the FPL 222 failed page information including at least physical device information (e.g., device serial number, device type, device physical address, etc.) for the target page of the scalable memory device 118. Thus, after a reboot or even in the event of a hardware configuration change, because the physical device information may remain relatively constant, the failed page information of the failed page stored in the FPL 222 may be used to identify the failed page of the scalable memory device 118 that may need to be taken offline. For example, the AER processor of the scalable memory device 118 may launch the FPL daemon 216 to change the state of the failed page of the scalable memory device 118 from a first state (e.g., online or available state) to a second state (e.g., offline or unavailable state) to offline the target page according to the failed page information persistently stored in the FPL 222.
In some embodiments, because the failed page information may be persistently stored in the FPL 222, the host device 102 may also provide an interface (e.g., an Application Programming Interface (API)) to a user (e.g., a system administrator) to enable the user to insert or delete the failed page information in the FPL 222 for debugging purposes and/or for reliability, availability, and serviceability (RAS) feature compliance testing purposes. For example, because the failed page information is used to automatically take the page offline after a system restart, hardware change, etc., unless its failed page information is removed from the FPL 222, the failed page may not be accessible after the system restart. Thus, in some embodiments, for example, the API may allow a user to remove a failed page from the failed page list such that the failed page may be accessed for testing purposes even after a system restart or after replacement of the scalable memory device 118.
FIG. 3 is a flow diagram of a method of generating a fault page log for an expandable memory device in accordance with one or more embodiments of the present disclosure.
For example, the method 300 may be performed by the processor 206 of the host device 102 shown in fig. 2. However, the present disclosure is not so limited, and the operations shown in method 300 may be performed by any suitable one of, or any suitable combination of, the components and elements of one or more of the embodiments described above. Furthermore, the present disclosure is not limited to the order or number of operations of method 300 shown in fig. 3, and may be changed to any desired order or number of operations as recognized by one of ordinary skill in the art. For example, in some embodiments, the order may be changed, or the method 300 may include fewer or additional operations.
Referring to fig. 2 and 3, the method 300 may begin and, at block 305, a Translation Layer Packet (TLP) may be received from an expandable memory device. For example, the scalable memory device 118 may generate an error bit (e.g., a 2-bit error) in a header of a TLP for the target page and may send the TLP to the host device 102 (e.g., root complex 202) that accesses the target page.
At block 310, an error bit in the tlp may be detected, and at block 315, an interrupt may be generated in response to detecting the error bit. For example, in some embodiments, the OS kernel 210 may receive a TLP from the root complex 202 and may detect an error bit in the TLP. In response to detecting the error bit, the OS kernel 210 may generate an interrupt and may launch an AER processor registered in the device driver 214 of the scalable memory device 118.
At block 320, the fault page information may be persistently stored in fault page log FPL. For example, as part of an interrupt routine of an AER processor of the scalable memory device 118, the AER processor may store fault page information (e.g., device serial number, device type, device physical address, etc.) for a fault page of the scalable memory device 118 in the FPL 222 and may launch the FPL daemon 216.
At block 325, the failed page may be taken offline from the FPL, and the method 300 may end. For example, FPL daemon 216 may read FPL 222 and may take a failed page of scalable memory device 118 offline according to the physical device information of scalable memory device 118 stored in FPL 222. In response to the offline of the failed page, any process or application accessing the failed page may be terminated and method 300 may end.
FIG. 4 is a flowchart of a method of taking a failed page of an expandable memory device offline after a system restart in accordance with one or more embodiments of the present disclosure.
For example, the method 400 may be performed by the processor 206 of the host device 102 shown in fig. 2. However, the present disclosure is not so limited, and the operations shown in method 400 may be performed by any suitable one of, or any suitable combination of, the components and elements of one or more embodiments described above. Furthermore, the present disclosure is not limited to the order or number of operations of method 400 shown in fig. 4, and may be changed to any desired order or number of operations as recognized by one of ordinary skill in the art. For example, in some embodiments, the order may be changed, or the method 400 may include fewer or additional operations.
Referring to fig. 2 and 4, method 400 may begin when the system is restarted, such that at block 405 any boot process (e.g., start up process) may be completed and at block 410 the fpl daemon may be started. For example, a reboot of the system may be performed by the OS kernel 210 of the host device 102. For example, as described above, in some embodiments, FPL daemon 216 may be launched after a system restart to automatically offline the failed page identified in FPL 222. For example, in some embodiments, FPL daemon 216 may be a background process that is initiated after the boot is completed (e.g., via registration in/etc/init. D/fpld), and at block 415, FPL daemon 216 may read FPL 222 (e.g., from persistent storage 218) to determine one or more failed pages in one or more scalable memory devices. It should be noted that because the scalable memory device 118 may not include system critical pages or memory types, the FPL 222 may not need to be updated during the memory initialization phase.
At block 420, one or more failed pages may be taken offline from the FPL and the method 400 may end. For example, the FPL daemon 216 may automatically take each of the failed pages identified in the FPL 222 of each of the scalable memory devices 118 of the scalable memory pool 114 offline after a system restart (but prior to any memory mapping of each of the failed pages identified in the FPL 222 to each of the scalable memory devices 118 of the scalable memory pool 114). Here, because the FPL 222 may include physical device information of the failed page, the failed page may be identified even if system memory mapping information (e.g., logical address) is changed after the system is restarted. Accordingly, error logs may be reduced and user experience may be improved.
Fig. 5 is a schematic block diagram of an expandable memory system in accordance with one or more embodiments of the present disclosure.
Referring to fig. 5, the scalable memory system may include a first host device (i.e., a master host device) 102a and a second host device (i.e., a guest host device) 102b each connected to the scalable memory pool 114. For example, the first host device 102a may be connected to the scalable memory pool 114 via the second memory interface 116. The first host device 102a may be the same or substantially the same as the host device 102 described above, and thus, redundant description thereof may not be repeated. Similarly, the scalable memory pool 114 and the second memory interface 116 may be the same or substantially the same as those described above, and thus, redundant descriptions thereof may not be repeated.
The second host device 102b may have a configuration similar to that of the host device 102 described above. For example, the second host device 102b may include a host Operating System (OS)/kernel 104, a host processor 106, a host memory 108 connected via a first memory interface 112, and a storage device 110 connected via a storage interface, and thus, a redundant description thereof may not be repeated. In some embodiments, the scalable memory pool 114 may be a network-attached scalable memory pool with respect to the second host device 102b. Accordingly, the second host device 102b may be connected to the expandable memory pool 114 through a suitable communication network (e.g., the internet, wide area network, local area network, cellular network, etc.) via a network interface (e.g., a network interface controller or Network Interface Card (NIC)) 502.
As described in more detail below with reference to FIG. 6, in some embodiments, before accessing the scalable memory devices 118a, 118b, and 118c of the scalable memory pool 114, the FPL daemon 216 of the second host device 102b may communicate with the first host device 102a to copy the FPL 222 from the first host device 102a, and the failed pages of each of the scalable memory devices 118a, 118b, and 118c may be taken offline using the FPL 222 before accessing the scalable memory pool 114. Accordingly, the system memory map of the second host device 102b may exclude the failed page in the FPL 222, and thus, the error log may be reduced and the user experience may be improved.
FIG. 6 is a flow diagram of a method of sharing failed page information of an expandable memory device in accordance with one or more embodiments of the present disclosure.
For example, the method 600 may be performed by the processing circuitry 204 of the second host device 102b shown in fig. 5 (e.g., including the processor (i.e., guest host processor) 206 and the memory 208 storing instructions for execution by the processor 206), the processing circuitry 204 may be the same or substantially the same as described above with reference to fig. 2, and thus, a redundant description thereof may not be repeated. However, the present disclosure is not so limited, and the operations shown in method 600 may be performed by any suitable one of, or any suitable combination of, the components and elements of one or more of the embodiments described above. Furthermore, the present disclosure is not limited to the order or number of operations of method 600 shown in fig. 6, and may be changed to any desired order or number of operations as recognized by one of ordinary skill in the art. For example, in some embodiments, the order may be changed, or the method 600 may include fewer or additional operations.
Referring to fig. 5 and 6, method 600 may begin and at block 605, an fpl daemon may be started. For example, in some embodiments, the FPL daemon of the second host device 102b may be launched before accessing the scalable memory pool 114. Here, because the system mapping (e.g., logical mapping) of the first host device 102a may be different from the system mapping of the second host device 102b, the second host device 102b may launch its FPL daemon 216 to replicate the FPL 222 of the first host device 102 a. Accordingly, at block 610, the failed page information may be requested from a host (e.g., the first host device 102 a), and at block 615, the failed page information may be received from the host. For example, the FPL daemon 216 of the second host device 102b may send a log request for the failed page information stored in the FPL 222 of the first host device 102a to the first host device 102a, and the first host device 102a may send the failed page information (or the FPL 222) stored in the FPL 222 of the first host device 102a to the second host device 102b in response to the log request.
At block 620, the FPL may be updated based on the received failed page information, and at block 625, one or more failed pages may be offline in accordance with the FPL. For example, the second host device 102b may update its FPL 222 based on the received failed page information and may take one or more failed pages of each of the scalable memory devices 118a, 118b, and 118c of the scalable memory pool 114 offline based on the updated FPL 222. Here, because the FPL 222 may include at least physical device information of the failed page of the scalable memory device 118, the second host device 102b (e.g., the FPL daemon 222) may take the failed page offline even if the system mappings (e.g., logical mappings) of the first host device 102a and the second host device 102b are different from each other.
Thus, at block 630, the system map may be updated by excluding the offline pages of the scalable memory pool, and the method 600 may end. For example, the second host device 102b may update (or may memory map) its system map based on the offline pages prior to accessing the scalable memory pool 114 such that the failed pages of the scalable memory pool 114 may not be accessed by one or more applications or processes of the second host device 102 b. Accordingly, error logs may be reduced and user experience may be improved.
FIG. 7 is a schematic block diagram of an expandable memory system in accordance with one or more embodiments of the present disclosure.
Referring to fig. 7, the scalable memory system may include a first host device 102a, a second host device 102b, a third host device 102c, etc. each connected to the scalable memory pool 114. For example, the first host device 102a may be connected to the scalable memory pool 114 via the second memory interface 116. The first host device 102a may be the same or substantially the same as the host device 102 described above, and thus, redundant description thereof may not be repeated. Similarly, the scalable memory pool 114 and the second memory interface 116 may be the same or substantially the same as those described above, and thus, redundant descriptions thereof may not be repeated.
The second host device 102b and the third host device 102c may each have a configuration similar to that of the host device 102 described above. For example, in some embodiments, as with the host device 102, the second host device 102b and the third host device 102c may each include a host Operating System (OS)/kernel 104, a host processor 106, a host memory 108 connected via a first memory interface 112, and a storage device 110 connected via a storage interface, and thus, redundant descriptions thereof may not be repeated. In some embodiments, the scalable memory pool 114 may be a network-attached scalable memory pool with respect to the second host device 102b and the third host device 102 c. Accordingly, the second host device 102b and the third host device 102c may each be connected to the scalable memory pool 114 via a network interface (e.g., a network interface controller or a Network Interface Card (NIC)) through a suitable communication network (e.g., the internet, a wide area network, a local area network, a cellular network, etc.).
As shown in fig. 7, the second host device 102b may receive an error bit (e.g., a 2-bit error on the same cache line) UE generated by a third scalable memory device 118c among the scalable memory pool 114. In this case, since the other host devices (e.g., the first host device 102a and the third host device 102 c) may not be mapped to (or may not access) the target page of the third scalable memory device 118c, the other host devices may not be aware of the error occurring in the target page of the third scalable memory device 118 c. Further, because the system map (e.g., logical map) of the first host device 102a and the third host device 102c may be different from the system map of the second host device 102b, the second host device 102b may send updated FPLs to at least one of the first host device 102a and the third host device 102 c.
For example, as described in more detail below with reference to fig. 8, in some embodiments, the second host device 102b (e.g., the AER processor corresponding to the third scalable memory device 118c registered thereon) may update its FPL 222 in response to the error bit and may launch its FPL daemon 216 to take the failed page offline. The FPL daemon 216 of the second host device 102b may communicate with the first host device 102a to send the updated FPL 222 to the first host device 102a. The first host device 102a may update its FPL 222 according to updates received from the second host device 102b and may broadcast the updates to other host devices (e.g., third host device 102c, etc.) registered with the first host device 102a to access the scalable memory pool 114.
In another embodiment, if the second host device 102b is communicating with other host devices (e.g., the third host device 102c, etc.), the second host device 102b may broadcast the update directly to the other host devices (e.g., the first host device 102a and the third host device 102 c), instead of: the update is first sent to the first host device 102a and the first host device 102a broadcasts the update to the other remaining host devices (e.g., third host device 102c, etc.). However, other suitable modifications may be possible depending on the communication configuration between the implementation of the scalable memory system and the host device.
FIG. 8 is a flow diagram of a method of updating fault page information of an expandable memory device in accordance with one or more embodiments of the present disclosure.
For example, the method 800 may be performed by the processing circuitry 204 of the first host device 102a shown in fig. 7 (e.g., including the processor 206 and the memory 208 storing instructions for execution by the processor 206), the processing circuitry 204 may be the same or substantially the same as the processing circuitry 204 of the host device 102 described above with reference to fig. 2, and thus, a redundant description thereof may not be repeated. However, the present disclosure is not so limited, and the operations shown in method 800 may be performed by any suitable one of, or any suitable combination of, the one or more embodiment components and elements described above. Furthermore, the present disclosure is not limited to the order or number of operations of method 800 shown in fig. 8, and may be changed to any desired order or number of operations as recognized by one of ordinary skill in the art. For example, in some embodiments, the order may be changed, or the method 800 may include fewer or additional operations.
Referring to fig. 7 and 8, the method 800 may begin and at block 805, an update to the FPL may be received from a second host device. For example, in some embodiments, as shown in fig. 7, the second host device 102b may receive the error bit UE from the target page of the third scalable memory device 118 c. The first host device 102a and the third host device 102c may not receive the error bit UE from the target page of the third scalable memory device 118 c. For example, the first host device 102a and the third host device 102c may not be mapped to and/or may not access a target page of the third scalable memory device 118 c. In response to receiving the error bit UE from the target page of the third scalable memory device 118c, the second host device 102b may update its FPL 222 and may take the target page offline according to the updated FPL 222. The second host device 102b (e.g., its FPL daemon 216) may send the update to the first host device 102a through a suitable communication interface.
At block 810, the FPL of the first host device may be updated based on the update. For example, the processor 206 of the first host device 102a (e.g., the FPL daemon 216) may update its FPL 222 based on the updates received from the second host device 102 b. At block 815, one or more failed pages may be offline in accordance with the updated FPL. For example, because the system map (e.g., logical map) of the first host device 102a may be different from the system map of the second host device 102b, one or more failed pages identified from the updated FPL (e.g., based on its physical device information) may be offline in the system map of the first host device 102a based on the updated FPL.
At block 820, the updated FPL may be broadcast to other registered daemons and the method 800 may end. For example, the updated FPL may be broadcast by the FPL daemon 216 of the first host device 102 a. For example, because the system map (e.g., logical map) of the other host devices (e.g., third host device 102c, etc.) may be different from the system map of the first host device 102a and the second host device 102b, the physical device information of the one or more failed pages may be broadcast such that the system map of the other host devices may be updated based on the updated FPL and the one or more failed pages may be offline. Accordingly, error logs may be reduced and user experience may be improved.
According to one or more embodiments of the present disclosure described above, fault page information for each of the scalable memory devices of the scalable memory pool may be generated and persistently stored in a fault page list and used to automatically offline a fault page as needed or desired. According to one or more embodiments of the present disclosure described above, the failed page information may include at least physical device information (e.g., device serial number, device type, device physical address, etc.) of the failed page, such that the failed page may be offline even when the system mapping (e.g., logical address) of the failed page is changed or different. Thus, error logs may be reduced while extending the usable life of the scalable memory devices in the scalable memory pool.
While particular embodiments may be implemented differently, the particular order of processing may be different from that described. For example, two processes described in succession may be executed concurrently or with substantially concurrence, or the processes may be executed in the reverse order of description.
It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Accordingly, a first element, first component, first region, first layer, or first section discussed above could be termed a second element, second component, second region, second layer, or second section without departing from the spirit and scope of the present disclosure.
It will be understood that when an element or layer is referred to as being "on," "connected to," or "coupled to" another element or layer, it can be directly on, connected or coupled to the other element or layer, or one or more intervening elements or layers may be present. Similarly, when a layer, region, or element is referred to as being "electrically connected" to another layer, region, or element, it can be directly electrically connected to the other layer, region, or element and/or be indirectly electrically connected to the other layer, region, or element via one or more intervening layers, regions, or elements. In addition, it will also be understood that when an element or layer is referred to as being "between" two elements or layers, it can be the only element or layer between the two elements or layers, or one or more intervening elements or layers may also be present.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and "having," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. For example, the expression "a and/or B" means a, B, or a and B. A expression such as "at least one of … …" modifies an entire column of elements when positioned after a column of elements, without modifying individual elements in the column. For example, the expressions "at least one of a, b or c", "at least one of a, b and c", and "at least one selected from the group consisting of a, b and c" indicate: only a, only b, only c, both a and b, both a and c, both b and c, all of a, b and c, or variations thereof.
As used herein, the terms "substantially," "about," and similar terms are used as approximation terms and not degree terms and are intended to explain the inherent deviations of measured or calculated values that would be recognized by one of ordinary skill in the art. Furthermore, the use of "may" in describing embodiments of the present disclosure means "one or more embodiments of the present disclosure. As used herein, the terms "use," "in use," and "used" may be considered synonymous with the terms "utilized," "in use," and "utilized," respectively. Furthermore, the term "exemplary" is intended to mean exemplary or illustrative.
An electronic or electrical device and/or any other related device or component in accordance with embodiments of the disclosure described herein may be implemented using any suitable hardware, firmware (e.g., application specific integrated circuits), software, or a combination of software, firmware, and hardware. For example, the various components of these devices may be formed on one Integrated Circuit (IC) chip or on separate IC chips. In addition, various components of these devices may be implemented on a flexible printed circuit film, a Tape Carrier Package (TCP), a Printed Circuit Board (PCB), or formed on one substrate. Further, the various components of these devices may be processes or threads running on one or more processors in one or more computing devices that execute computer program instructions and interact with other system components to perform the various functions described herein. The computer program instructions are stored in a memory that can be implemented in a computing device using standard memory means, such as, for example, random Access Memory (RAM). The computer program instructions may also be stored in other non-transitory computer readable media such as, for example, a CD-ROM, a flash drive, etc. Furthermore, those skilled in the art will recognize that: the functionality of the various computing devices may be combined or integrated into a single computing device, or the functionality of a particular computing device may be distributed over one or more other computing devices, without departing from the spirit and scope of the exemplary embodiments of the present disclosure.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Although a few embodiments have been described, those skilled in the art will readily appreciate that various modifications are possible in the embodiments without departing from the spirit and scope of the present disclosure. It will be understood that the description of features or aspects within each embodiment should generally be taken to be applicable to other similar features or aspects in other embodiments unless otherwise described. Thus, it will be apparent to one of ordinary skill in the art that the features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with the features, characteristics, and/or elements described in connection with other embodiments unless specifically indicated otherwise. It is therefore to be understood that the foregoing is illustrative of various example embodiments and is not to be construed as limiting the specific embodiments disclosed herein, and that various modifications to the disclosed embodiments, as well as other example embodiments, are intended to be included within the spirit and scope of the disclosure as defined in the appended claims and equivalents thereof.

Claims (20)

1. A system for processing a failed page, comprising:
a host processor;
a host memory connected to the host processor through a first memory interface; and
an expandable memory pool connected to the host processor through a second memory interface, the second memory interface being different from the first memory interface,
wherein the host memory includes instructions that, when executed by the host processor, cause the host processor to:
detecting an error in a target page of a first memory device of the scalable memory pool;
generating an interrupt in response to detecting the error;
storing fault page information corresponding to a target page of the first memory device in a fault page log; and is also provided with
The state of the target page of the first memory device is changed from the first state to the second state according to the fault page log.
2. The system of claim 1, wherein the second memory interface comprises: peripheral component interconnect express interfaces and computing express link interconnects.
3. The system of claim 2, wherein the first memory interface comprises: a dual inline memory module.
4. The system of claim 1, wherein the scalable memory pool comprises at least two different types of computing fast link memory devices.
5. The system of claim 1, wherein the instructions further cause the host processor to:
performing a restart;
reading the fault page log to identify one or more fault pages of the scalable memory pool; and is also provided with
And setting a second state of the one or more fault pages according to the fault page log.
6. The system of claim 1, wherein the instructions further cause the host processor to:
receiving a log request for a fault page log from a guest host processor, the guest host processor configured to access an extensible memory pool; and is also provided with
In response to the log request, a fault page log is sent to the guest host processor,
wherein the guest host processor is configured to: a second state of one or more pages of the scalable memory pool is set based on the fault page log.
7. The system of claim 1, wherein the instructions further cause the host processor to:
receiving an update from a first guest host processor, the first guest host processor detecting an error in a second memory device in the scalable memory pool;
identifying a failed page of the second memory device based on the update;
updating a fault page log; and is also provided with
Based on the updated fault page log, a second state of the fault page of the second memory device is set.
8. The system of claim 7, wherein the instructions further cause the host processor to:
the updated fault page log is broadcast to a second guest host processor configured to access the scalable memory pool.
9. The system of any of claims 1 to 8, wherein the error in the target page of the first memory device of the scalable memory pool is a multi-bit error in the target page of the first memory device of the scalable memory pool, and the failed page information includes physical device information of the target page of the first memory device.
10. A method of handling a failed page, comprising:
detecting, by a kernel of a first host device, an error in a target page of a first memory device of the scalable memory pool;
generating, by the kernel, an interrupt in response to detecting the error;
storing, by a device driver corresponding to the first memory device, fault page information corresponding to a target page of the first memory device in a fault page log; and
the state of the target page of the first memory device is changed from the first state to the second state by the fault page log FPL daemon according to the fault page log.
11. The method of claim 10, wherein a first memory device of the scalable memory pool is connected to the first host device via a peripheral component interconnect express interface and a computing express link interconnect.
12. The method of claim 10, wherein the scalable memory pool comprises at least two different types of computing fast link memory devices.
13. The method of claim 10, further comprising:
performing a reboot by the kernel;
reading, by the FPL daemon, the fault page log to identify one or more fault pages of the scalable memory pool; and
a second state of the one or more failed pages is set by the FPL daemon according to the failed page log.
14. The method of claim 10, further comprising:
receiving, by the FPL daemon of the first host device, a log request for a fault page log from the FPL daemon of the second host device, the second host device configured to access the scalable memory pool; and
in response to the log request, a fault page log is sent by the FPL daemon of the first host device to the second host device,
wherein the FPL daemon of the second host device is configured to: a second state of one or more failed pages of the scalable memory pool is set based on the failed page log.
15. The method of claim 10, further comprising:
in response to the second host device detecting an error in the second memory device in the scalable memory pool, receiving, by the FPL daemon of the first host device, an update from the FPL daemon of the second host device;
identifying, by the FPL daemon of the first host device, a failed page of the second memory device according to the update;
updating, by the FPL daemon of the first host device, the fault page log; and
the second state of the failed page of the second memory device is set by the FPL daemon of the first host device based on the updated failed page log.
16. The method of claim 15, further comprising: the updated fault page log is broadcast by the FPL daemon of the first host device to a third host device configured to access the scalable memory pool.
17. The method of any of claims 10 to 16, wherein the error in the target page of the first memory device of the scalable memory pool is a multi-bit error in the target page of the first memory device of the scalable memory pool, and the failed page information comprises physical device information of the target page of the first memory device.
18. A host device, comprising:
a root complex connected to the scalable memory pool through a memory interface and configured to parse packets received from a memory device of the scalable memory pool;
a core configured to detect an error bit corresponding to a failed page of the first memory device from the parsed packet and generate an interrupt in response to detecting the error bit;
a driver of the first memory device configured to store, in response to the interrupt, fault page information corresponding to a fault page of the first memory device in a fault page log; and
the fault page log daemon is configured to change a state of a fault page from a first state to a second state based on the fault page log.
19. The host device of claim 18, wherein the scalable memory pool comprises at least two different types of computing fast link memory devices.
20. The host device of claim 18, wherein, in response to restarting, the FPL daemon is configured to: the fault page log is read to identify one or more fault pages of the scalable memory pool, and a second state of the one or more fault pages is set according to the fault page log.
CN202310491026.XA 2022-05-18 2023-05-04 System for processing faulty page, method for processing faulty page, and host device Pending CN117093390A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US63/343,410 2022-05-18
US17/845,679 US12019503B2 (en) 2022-05-18 2022-06-21 Systems and methods for expandable memory error handling
US17/845,679 2022-06-21

Publications (1)

Publication Number Publication Date
CN117093390A true CN117093390A (en) 2023-11-21

Family

ID=88772445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310491026.XA Pending CN117093390A (en) 2022-05-18 2023-05-04 System for processing faulty page, method for processing faulty page, and host device

Country Status (1)

Country Link
CN (1) CN117093390A (en)

Similar Documents

Publication Publication Date Title
US10061534B2 (en) Hardware based memory migration and resilvering
US11314866B2 (en) System and method for runtime firmware verification, recovery, and repair in an information handling system
US10229018B2 (en) System and method for data restore flexibility on dual channel NVDIMMs
US8516298B2 (en) Data protection method for damaged memory cells
US20150082081A1 (en) Write cache protection in a purpose built backup appliance
US7783918B2 (en) Data protection method of storage device
US12013946B2 (en) Baseboard memory controller (BMC) reliability availability and serviceability (RAS) driver firmware update via basic input/output system (BIOS) update release
CN110134329B (en) Method and system for facilitating high capacity shared memory using DIMMs from retirement servers
US9442814B2 (en) Systems and methods for improved fault tolerance in RAID configurations
US11307785B2 (en) System and method for determining available post-package repair resources
EP4280064A1 (en) Systems and methods for expandable memory error handling
US8312215B2 (en) Method and system for resolving configuration conflicts in RAID systems
US10474384B2 (en) System and method for providing a back door communication path between channels on dual-channel DIMMs
CN112540869A (en) Memory controller, memory device, and method of operating memory device
US11010250B2 (en) Memory device failure recovery system
US11740969B2 (en) Detecting and recovering a corrupted non-volatile random-access memory
US11809742B2 (en) Recovery from HMB loss
CN117093390A (en) System for processing faulty page, method for processing faulty page, and host device
US11593209B2 (en) Targeted repair of hardware components in a computing device
US9128887B2 (en) Using a buffer to replace failed memory cells in a memory component
US11893275B2 (en) DRAM-less SSD with recovery from HMB loss
CN118467214A (en) System and method for fault page handling
CN115858253A (en) Techniques for memory mirroring across interconnects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination