CN115421948A - Method for detecting memory data fault and related equipment thereof - Google Patents
Method for detecting memory data fault and related equipment thereof Download PDFInfo
- Publication number
- CN115421948A CN115421948A CN202210912646.1A CN202210912646A CN115421948A CN 115421948 A CN115421948 A CN 115421948A CN 202210912646 A CN202210912646 A CN 202210912646A CN 115421948 A CN115421948 A CN 115421948A
- Authority
- CN
- China
- Prior art keywords
- data
- memory
- fault
- memory space
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015654 memory Effects 0.000 title claims abstract description 467
- 238000000034 method Methods 0.000 title claims abstract description 98
- 238000007689 inspection Methods 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims abstract description 22
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 238000013500 data storage Methods 0.000 claims description 13
- 238000007726 management method Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 description 30
- 230000008439 repair process Effects 0.000 description 23
- 238000002955 isolation Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 20
- 238000004891 communication Methods 0.000 description 13
- 238000012937 correction Methods 0.000 description 12
- 238000013461 design Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 9
- 238000003745 diagnosis Methods 0.000 description 9
- 238000012795 verification Methods 0.000 description 8
- 230000005971 DNA damage repair Effects 0.000 description 6
- 238000001514 detection method Methods 0.000 description 4
- 239000008187 granular material Substances 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 102000002706 Discoidin Domain Receptors Human genes 0.000 description 1
- 108010043648 Discoidin Domain Receptors Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000004146 energy storage Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0727—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application discloses a method for detecting memory data faults and related equipment thereof, which are applied to the field of storage. The method comprises the following steps: whether data stored in a first memory space fails or not is detected in a polling period, the polling period is a period for detecting a DDR5 memory, the first memory space belongs to the DDR5 memory, the data comprises N bit data, and N is larger than or equal to 1. In the routing inspection process, when a fault of data is detected, fault information corresponding to the fault data is obtained from a memory mode register, wherein the fault information at least comprises a memory address corresponding to the fault data, the memory mode register is located in a DDR5 memory, and the memory mode register is used for storing a mapping relation between the data stored in a first memory space and the memory address and sending the fault information to an OS (operating system) or a BMC (baseboard management controller). According to the method and the device, the data stored in the DDR5 memory, which are accurately positioned and have faults, are subjected to fault sensing, no additional hardware circuit is needed, and the cost is reduced.
Description
Technical Field
The embodiment of the application relates to the field of storage, in particular to a method for detecting memory data faults and related equipment thereof.
Background
With the rapid development of technical industry computing power, the capacity of the current memory is larger and larger, and for a double data rate synchronous dynamic random access memory (DDR SDRAM), that is, a DDR SDRAM faces higher capacity, higher speed and smaller process technology, the probability of a unit error occurring in the memory array is increased, and the failure rate is higher.
DDR5 is to reduce the defect rate of a memory chip through an on-die Error Correction Code (ECC) to ensure the data accuracy at high speed and high density. Specifically, an extra ECC memory is provided to detect and correct errors before data is sent to a Central Processing Unit (CPU), but since the system does not have an extra bus to transmit ECC data, the system cannot sense an ECC error detected by on-die ECC, that is, the system cannot sense a memory failure. One current technique is to add a hardware circuit to provide an alarm signal to the memory controller after the ECC error exceeds a threshold, so that the system senses the failure of the memory. However, only the occurrence of a fault can be sensed through the alarm signal, the accurate position of the fault cannot be accurately positioned, and the cost is increased through the hardware circuit.
Disclosure of Invention
The application provides a method for detecting memory data faults and related equipment thereof, which are applied to the field of medium DDR5 memories. The severity and the accurate position of the data fault in the memory can be accurately positioned, an additional hardware circuit is not required to be added, the cost is reduced, and the method is suitable for more application scenes.
In a first aspect, a method for detecting memory data failure is provided, and the method includes.
Whether data stored in a first memory space fails or not is detected in a patrol cycle, the patrol cycle is a cycle for detecting a DDR5 memory, the first memory space belongs to the DDR5 memory, data of the first memory space comprises N bit data, and N is larger than or equal to 1.
And in the routing inspection process, when a fault of data is detected, obtaining fault information corresponding to the fault data from a memory mode register, wherein the fault information at least comprises a memory address corresponding to the fault data, the memory mode register is used for managing a DDR5 memory, and the memory mode register is used for storing the mapping relation between the data to be routed and the memory address.
Then, the failure information is transmitted to an Operating System (OS) or a Baseboard Management Controller (BMC).
In the embodiment of the application, when at least one data in the first memory space is determined to have a fault in the polling period, the fault information of the memory address at least including the fault data is acquired and sent to the OS or the BMC, so that the severity and the accurate position of the data fault in the memory can be accurately located, an additional hardware circuit is not required to be added, the cost is reduced, and the method is suitable for more application scenarios.
In a possible implementation manner of the first aspect, a value of the polling period and a first memory space are configured in a Basic Input Output System (BIOS) or a BMC, and the first memory space is less than or equal to a data storage space of the DDR5 memory.
In the embodiment of the application, the value of the first polling period and the first memory space are configured in multiple modes, so that the application scenes of the scheme are increased, and the selectivity of the scheme is improved.
In a possible implementation manner of the first aspect, when it is detected that the first data has a fault, the polling is stopped, and at least a first memory address corresponding to the first data is obtained from the memory mode register, where the first data is data stored in the first memory space.
In the embodiment of the application, the specific implementation mode that corresponding fault information is acquired when data has faults in the inspection process is specifically explained, and the reliability of the scheme is improved.
In a possible implementation manner of the first aspect, after at least a first memory address corresponding to first data is obtained from the memory mode register, second data in the first memory space is continuously patrolled, where the second data is data remaining in the first memory space to be patrolled.
In the embodiment of the application, when the data has a fault, after the fault information of the fault data is acquired, the remaining data to be inspected is continuously detected, so that the data of the whole first memory space can be ensured to be inspected, and the reliability of the memory data is ensured.
In a possible implementation manner of the first aspect, in the polling period, one data in the first memory space and the first check code corresponding to the data are acquired each time. And then, obtaining a corresponding second check code based on the obtained data, and determining that the data has a fault based on that the first check code and the second check code corresponding to the data are different, or determining that the data has no fault based on that the first check code and the second check code corresponding to the data are the same.
In the embodiment of the application, a specific implementation mode for detecting whether the data fails is embodied, and the reliability of the scheme is improved.
In a possible implementation manner of the first aspect, when a fault is detected in data, error correction is performed on the faulty data to obtain target data, and the target data is written into a memory address of the faulty data.
In the embodiment of the application, error correction is performed on the failed data to obtain the target data, and the target data is written into the memory address of the failed data, so that the accuracy of data stored in the DDR5 memory is further ensured.
In one possible implementation manner of the first aspect, the number of failures is obtained, where the number of failures is the number of times of failures occurring in the current inspection cycle, and then the value of the inspection cycle is adjusted based on the number of times of failures.
In the embodiment of this application, can adjust the value of patrolling and examining the cycle in a flexible way, improve the frequency of patrolling and examining when the number of times that breaks down is more, the possibility of the more accurate real-time monitoring data trouble of ability, the reliability of the data of guarantee memory promotes the flexibility of scheme.
In a possible implementation manner of the first aspect, after the at least two polling periods, a fault-free address interval in the first memory space in the at least two polling periods is obtained, and a memory space in the first memory space that does not include the fault-free address interval is determined to be a second memory space, that is, the second memory space is smaller than the first memory space. And in the next polling period, whether the data stored in the second memory space has a fault is detected.
In the implementation mode of the application, the second memory space is detected in the polling period, namely after the polling period for a certain number of times, data of the address without fault is not polled, the polling efficiency can be improved, and the occupation of equipment resources is reduced.
In a possible implementation manner of the first aspect, after the second memory space is patrolled in a certain number of patrol cycles, whether the data in the first memory space fails is detected again in a next patrol cycle.
In an implementation manner, the certain number of times may be at least two times, and may specifically be determined according to an actual situation, for example, a failure frequency of the detection result of the multiple polling periods is determined, when there are many failures, all data in the first memory space may be detected within the polling period after three times or five times, and when there are few failures, all data in the first memory space may be detected within the polling period after 10 times or more.
In the embodiment of the application, all the data in the first memory space are detected again, so that the whole DDR5 memory can be guaranteed to be free of faults as much as possible, and the reliability of the memory data is improved.
In a possible implementation manner of the first aspect, the failure information further includes a current row address error count value and/or a current column address error count value. The current row address error count value is used for indicating the number of times of errors occurring in a row of the memory address of the failed data, and the current column address error count value is used for indicating the number of times of errors occurring in a column of the memory address of the failed data.
In the embodiment of the application, the memory address of the data with the fault in the first memory space and the fault information such as the error count value of the current row address and/or the error count value of the current column address are obtained, so that the severity and the accurate position of the memory fault can be accurately positioned, an additional hardware circuit is not required to be added, the cost is reduced, and the method is suitable for more application scenarios.
In a possible implementation manner of the first aspect, after the failure information is obtained, the failure information is sent to the target device, so that the target device determines a target address based on the failure information, and the target device stores failure big data of data in the DDR5 memory.
In the implementation mode of the application, the fault information is sent to the target equipment, so that the fault big data is enriched, and the precision and the accuracy of predicting the fault risk can be improved.
In a possible implementation manner of the first aspect, the target device determines the target address based on the fault information, and the target device may predict that the data of the same target address exists in other DDR5 memories associated with the target device and has a fault risk based on the target address, and may synchronously perform memory repair isolation on the target addresses of the other DDR5 memories.
In the implementation mode of the application, the target device predicts that the same risk exists in other DDR5 memories of the same type through the target address, measures are taken, the application scenes are increased, the working efficiency of the whole application scene is integrally improved, the fault routing inspection workload of single DDR5 memory data is reduced, and the reliability of the memory data is ensured as much as possible.
In a possible implementation manner of the first aspect, the memory repair isolation is a hard isolation repair or a soft isolation repair.
In the embodiment of the application, multiple memory repair isolation modes increase the application scenarios of the scheme, and reflect the selectivity and flexibility of the scheme.
In a second aspect, there is provided a processing apparatus comprising:
the processing unit is used for detecting whether data stored in a first memory space fails or not in a polling period, the polling period is a period for detecting a DDR5 memory, the first memory space belongs to the DDR5 memory, the data of the first memory space comprises N bit data, and N is larger than or equal to 1.
The data routing inspection device comprises an acquisition unit and a memory mode register, wherein the acquisition unit is used for acquiring fault information corresponding to fault data from the memory mode register when the fault data is detected in the routing inspection process, the fault information at least comprises a memory address corresponding to the fault data, the memory mode register is used for managing a DDR5 memory, and the memory mode register is used for storing the mapping relation between the data to be routed and the memory address.
And the sending unit is used for sending fault information to the operating system OS or the baseboard management controller BMC.
In the embodiment of the application, when at least one data in the first memory space is determined to have a fault in the polling period, the fault information at least including the memory address of the fault data is acquired and sent to the OS or BMC, so that the severity and the accurate position of the data fault in the memory can be accurately positioned, an additional hardware circuit is not required to be added, the cost is reduced, and the method and the device are suitable for more application scenarios.
In a possible implementation manner of the second aspect, the processing unit is further configured to configure the value of the polling period and the first memory space in the BIOS or the BMC, where the first memory space is less than or equal to a data storage space of the DDR5 memory.
In the embodiment of the application, the value of the first polling period and the first memory space are configured in multiple modes, so that the application scenes of the scheme are increased, and the selectivity of the scheme is improved.
In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to stop the polling when it is detected that the first data has a fault, and obtain at least a first memory address corresponding to the first data from the memory mode register, where the first data is data stored in the first memory space.
In the embodiment of the application, the specific implementation mode that corresponding fault information is acquired when data has faults in the inspection process is specifically explained, and the reliability of the scheme is improved.
In a possible implementation manner of the second aspect, the processing unit is further configured to continue to patrol and examine second data in the first memory space after at least the first memory address corresponding to the first data is obtained from the memory mode register, where the second data is data remaining in the first memory space and to be polled.
In the embodiment of the application, when the data has a fault, after the fault information of the fault data is acquired, the remaining data to be inspected is continuously detected, so that the data of the whole first memory space can be ensured to be inspected, and the reliability of the memory data is ensured.
In a possible implementation manner of the second aspect, the obtaining unit is further configured to obtain, in the polling period, one data in the first memory space and the first check code corresponding to the data each time.
And the processing unit is specifically configured to obtain a corresponding second check code based on the obtained data, and determine that the data has a fault based on that the first check code and the second check code corresponding to the data are different, or determine that the data has no fault based on that the first check code and the second check code corresponding to the data are the same.
In the embodiment of the application, a specific implementation mode for detecting whether the data fails is embodied, and the reliability of the scheme is improved.
In a possible implementation manner of the second aspect, the processing unit is further configured to correct errors of the faulty data to obtain the target data when detecting that the data has a fault.
And the sending unit is also used for writing the target data into the memory address of the failed data.
In the embodiment of the application, error correction is performed on the failed data to obtain target data, and the target data is written into the memory address of the failed data, so that the accuracy of data stored in the DDR5 memory is further ensured.
In a possible implementation manner of the second aspect, the obtaining unit is further configured to obtain a number of times of the failure, where the number of times of the failure is a number of times of occurrence of the failure in the current inspection cycle.
And the processing unit is also used for adjusting the value of the polling period based on the failure times.
In the embodiment of this application, can adjust the value of patrolling and examining the cycle in a flexible way, improve the frequency of patrolling and examining when the number of times of breaking down is more, can be more accurate real-time monitoring data trouble the possibility, the reliability of the data of guarantee memory promotes the flexibility of scheme.
In a possible implementation manner of the second aspect, the obtaining unit is further configured to obtain a fault-free address interval in the first memory space in the at least two polling periods after the at least two polling periods.
The processing unit is further configured to determine that the memory space in which the first memory space does not include the fault-free address interval is a second memory space, where the second memory space is smaller than the first memory space. And detecting whether the data stored in the second memory space has a fault or not in the next polling period.
In the implementation mode of the application, the second memory space is detected in the polling period, namely after the polling period for a certain number of times, data of the address without fault is not polled, the polling efficiency can be improved, and the occupation of equipment resources is reduced.
In a possible implementation manner of the second aspect, the processing unit is further configured to detect whether the data in the first memory space fails again in a next polling cycle after polling the second memory space in a certain number of polling cycles.
In an implementation manner, the certain number of times may be at least two times, and may be specifically determined according to an actual situation, for example, the failure frequency of the detection result of the foregoing multiple polling periods may be determined, when there is a large number of failures, all data in the first memory space may be detected three or five times later in the polling period, and when there is a small number of failures, all data in the first memory space may be detected 10 times or more later in the polling period.
In the embodiment of the application, all the data in the first memory space are detected again, so that the whole DDR5 memory can be guaranteed to be free of faults as much as possible, and the reliability of the memory data is improved.
In a possible implementation manner of the second aspect, the failure information further includes a current row address error count value and/or a current column address error count value. The current row address error count value is used for indicating the number of times of errors occurring in a row of the memory address of the failed data, and the current column address error count value is used for indicating the number of times of errors occurring in a column of the memory address of the failed data.
In the embodiment of the application, the memory address of the data with the fault in the first memory space and the fault information such as the current row address error count value and/or the current column address error count value are obtained, so that the severity and the accurate position of the memory fault can be accurately positioned, an additional hardware circuit is not required to be added, the cost is reduced, and the method and the device are suitable for more application scenarios.
In a possible implementation manner of the second aspect, after the failure information is obtained, the sending unit is further configured to send the failure information to the target device, so that the target device determines a target address based on the failure information, where the target device stores failure big data of data in the DDR5 memory.
In the implementation mode of the application, the fault information is sent to the target equipment, so that the fault big data is enriched, and the precision and the accuracy of predicting the fault risk can be improved.
In a possible implementation manner of the second aspect, the memory repair isolation is a hard isolation repair or a soft isolation repair.
In the embodiment of the application, multiple memory repair isolation modes increase the application scenarios of the scheme, and reflect the selectivity and flexibility of the scheme.
In a third aspect, another computing device is provided, which may include a processor coupled with a memory, the memory being configured to store instructions, wherein execution of the instructions in the memory by the processor causes the computing device to perform the method described in the first aspect or any one of the possible implementations of the first aspect.
In a fourth aspect, there is provided another computing device comprising a processor for executing a computer program (or computer executable instructions) stored in a memory, which when executed, causes the method in the first aspect and its various possible implementations to be performed.
In one possible implementation, the processor and the memory are integrated together;
in another possible implementation, the memory is located external to the computing device.
The computing device also includes a communication interface for the computing device to communicate with other devices, such as the transmission or reception of data and/or signals. Illustratively, the communication interface may be a transceiver, circuit, bus, module, or other type of communication interface.
A fifth aspect provides a computer readable storage medium comprising computer readable instructions which, when executed on a computer, cause the method described in the first aspect, any possible implementation manner of the first aspect, of the present application to be performed.
A sixth aspect provides a computer program product comprising computer readable instructions which, when run on a computer, cause the method described in any of the possible implementations of the first aspect of the present application to be performed.
Drawings
FIG. 1 is a schematic diagram of an on-die ECC module in DDR 5;
FIG. 2a is an architectural diagram of a computing device according to an embodiment of the present application;
fig. 2b is a schematic structural diagram of a DDR5 memory and a memory mode register according to an embodiment of the present application;
fig. 3a is a schematic diagram of a method for detecting a memory data failure according to an embodiment of the present disclosure;
FIG. 3b is a schematic diagram of detecting data failure according to an embodiment of the present application;
fig. 4a is a schematic diagram of data stored in a DDR5 memory according to an embodiment of the present application;
fig. 4b is a schematic diagram of an application scenario provided in an embodiment of the present application;
FIG. 5 is another schematic diagram of an application scenario traversed by an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure
Fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a method for detecting memory data faults and related equipment thereof, which are applied to the field of storage. The method for detecting the memory data fault can accurately position the severity and the accurate position of the memory fault, does not need to add an additional hardware circuit, reduces the cost, and is suitable for more application scenes.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Before introducing the embodiment of the present application, a simple description is first given of reducing the basic failure rate of the DDR5 memory chip and enabling the system to sense the failure of the memory chip in the current storage field, so as to facilitate subsequent understanding of the embodiment of the present application.
The DDR5 of the memory reduces the error correction function of the memory chip through the on-die ECC to ensure the accuracy of data at high speed and high density. Specifically, referring to fig. 1, fig. 1 is a schematic diagram of an on-die ECC module in DDR5, wherein the on-die ECC module of the DDR5 memory granule is composed of an ECC check bit generator, a synthesis decoder, and a correction module. When data is written into the memory, an 8-bit check code, namely ECC1, is calculated for every 128 bits of data by using an ECC check bit generator, and then the data and the ECC1 are written into the memory array. When data is read from the memory, the read data in the memory is processed by the comprehensive generator to generate a new check code ECC2, the stored ECC1 is compared with the ECC2, if no error exists, the data in the memory is sent out, if an error occurs, the comprehensive decoder determines the bit position of the error and instructs the correction module to correct the error, the error-corrected data is sent out, and the error-corrected data is not written back to the memory array.
Since the DDR5 system does not have an extra bus for transmitting ECC data, the system cannot sense an ECC error detected by the on-die ECC, that is, cannot sense a memory failure. At present, in order to enable a system to sense that a memory fails, a hardware circuit is added on a DDR5 Dual Inline Memory Module (DIMM) using on-die ECC to detect an error correction code error in the memory, namely the ECC error, and after the ECC error exceeds a threshold value, an alarm signal is provided for a memory controller, so that ECC counting is realized to predict and alarm, and the system senses that the memory fails. However, the prediction alarm through the ECC error correction code counting only enables the system to sense the failure of the memory, but cannot accurately locate the verification degree and the accurate position of the memory failure, and the additional hardware circuit increases the cost.
In order to solve the above problem, embodiments of the present application provide a method for detecting a memory data failure and a related device thereof, which are applied to devices in the storage field. The method comprises the following steps: whether data stored in a first memory space fails or not is detected in a patrol cycle, the patrol cycle is a cycle for detecting a DDR5 memory, the first memory space belongs to the DDR5 memory, data of the first memory space comprises N bit data, and N is larger than or equal to 1. And in the routing inspection process, when a fault of data is detected, obtaining fault information corresponding to the fault data from a memory mode register, wherein the fault information at least comprises a memory address corresponding to the fault data, the memory mode register is located in a DDR5 memory, and the memory mode register is used for storing the mapping relation between the data to be routed and the memory address. And sends the failure information to the OS or BMC. In the embodiment of the application, when at least one data in the first memory space is determined to have a fault in the polling period, the fault information at least including the memory address of the fault data is acquired and sent to the OS or BMC, so that the severity and the accurate position of the data fault in the memory can be accurately positioned, an additional hardware circuit is not required to be added, the cost is reduced, and the method and the device are suitable for more application scenarios.
For better understanding of the embodiments of the present application, first, a method for detecting a memory data failure provided by the embodiments of the present application is described in detail below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
First, for the convenience of understanding the subsequent embodiments, a computing device architecture to which the method for detecting memory data failure provided in the embodiments of the present application is applied is briefly described. Specifically, referring to fig. 2a, fig. 2a is a schematic diagram of an architecture of a computing device according to an embodiment of the present application, which specifically includes:
a Central Processing Unit (CPU) 204, a BMC203 or an OS205, and at least one DDR5 memory 202, a DDR memory 202, and a BMC203 are connected to the CPU 204. The CPU204 is configured to detect whether data stored in the first memory space fails in a polling period, where the polling period is a period for detecting the DDR5 memory 202, the first memory space belongs to the DDR5 memory 202, the data includes N bit data bits, and N is greater than or equal to 1.
In the polling process, when a data failure is detected, the CPU204 obtains failure information corresponding to the failed data from the memory mode register, where the failure information at least includes a memory address corresponding to the failed data, the memory mode register is used to manage the DDR5 memory 202, and the memory mode register is used to store a mapping relationship between the data stored in the first memory space and the memory address. Specifically, please refer to fig. 2b, where fig. 2b is a schematic structural diagram of a DDR5 memory and a memory mode register according to an embodiment of the present application. The CPU204 includes at least one memory mode register 2041, the memory mode register 2041 is connected to the DDR5 memory, the memory mode register 2041 is used to support the CPU204 to manage the DDR5 memory, and specifically, the CPU204 performs read/write operation and other operations such as polling on the DDR5 memory through an instruction supported by the memory mode register 2041. In one possible implementation, the memory mode register 2041 is located within an Integrated Memory Controller (IMC) in the CPU 204.
After acquiring the failure information, the CPU204 is also configured to send the failure information to the BMC203 or the OS 205.
In the embodiment of the application, the CPU is used for determining that the data stored in the first memory space has a fault in the polling period, acquiring fault information and sending the fault information to the BMC or the OS, wherein the fault information comprises a fault address indicating the position of the data with the fault in the first memory space, so that the fault of the data stored in the DDR5 memory can be accurately sensed at the position where the fault occurs, an additional hardware circuit is not needed, the cost is reduced, and the method is suitable for more application scenarios.
In one possible implementation, the computing device also includes a Basic Input Output System (BIOS) 201. The BIOS201 and the BMC203 are connected to the CPU 204.
Before polling the first memory space of the DDR5 memory, in one implementation, the value of the polling period and the first memory space are configured in the BIOS201 or the BMC203, and the first memory space is less than or equal to the data storage space of the DDR5 memory. It will be appreciated that the value of the patrol period and the first memory space can also be configured in the OS205 or an application program of the application layer of the computing device. Specifically, the first memory space may be configured by setting any two addresses in the DDR memory as the start address and the last address, respectively. In one implementation, the size of the first memory space is equal to the data storage space of the DDR5 memory, which is not limited herein.
In an implementation manner, the CPU204 may call the BMC203 or the BIOS201 to detect whether the data stored in the first memory space fails in the polling period. In the polling process, when a data failure is detected, the CPU204 obtains failure information corresponding to the data of the failure from the memory mode register 2041. In other implementation manners, the CPU204 may further call the OS205 or an application program of the application layer to detect whether the data stored in the first memory space has a fault in the polling period, and in the polling process, when it is detected that the data has the fault, the CPU204 obtains fault information corresponding to the faulty data from the memory mode register 2041. It is understood that the determination may be made according to specific requirements in practical situations, and is not limited herein.
And in one implementation, the CPU204 also sends fault information to the BMC203 or the OS 205. Thereby sensing a failure of data in the DDR5 memory. And then the BMC203 or the OS205 can also obtain a fault feature based on the fault information as a fault diagnosis system, and notify the CPU204 of implementing memory isolation repair on the memory address of the faulty data based on the fault feature.
It should be noted that, in addition, in a possible implementation manner, the BIOS201 is configured to detect whether the data stored in the first memory space fails during the polling period. In the inspection process, when a data failure is detected, the BIOS201 obtains failure information corresponding to the data failure from the memory mode register 2041. In an implementation manner, the BIOS201 further sends the obtained failure information to the BMC203 or the OS205, so that the outside senses that the data in the DDR5 memory fails. In one possible implementation, the BMC203 or the OS205 obtains a fault feature based on the fault information, and notifies the CPU204 to implement the memory isolation repair based on the fault feature.
In one possible implementation, the BMC203 is configured to detect whether the data stored in the first memory space fails during the polling period. In the polling process, when a data failure is detected, the BMC203 acquires the failure information corresponding to the data failure from the memory mode register 2041. The external sensing DDR5 memory data can be in failure. In one implementation, the BMC203 further obtains a fault feature based on the fault information, and notifies the CPU204 to implement the memory isolation repair based on the fault feature.
In the foregoing multiple implementation manners, whether the data in the first memory space fails or not is detected in the polling cycle by different manners, and when the data fails, the failure information of the failed data is obtained. The application scenes suitable for the scheme are increased, and the diversity and the selectivity of the scheme are reflected.
Specifically, the following takes the computing device in fig. 2a to implement the method for detecting a memory data failure provided in the present application as an example for detailed description, specifically referring to fig. 3a, where fig. 3a is a schematic diagram of the method for detecting a memory failure provided in the embodiment of the present application, and specifically includes:
301. and detecting whether the first memory space fails in the polling period.
Specifically, the CPU detects whether the first memory space has a fault within the polling period. The polling period is a period configured to poll whether the whole first memory space has a fault.
Illustratively, the CPU calls the BIOS, BMC, OS, or application program to detect whether the data stored in the first memory space fails during the polling period. Or, the BIOS, the BMC, the OS or the application program detects whether the data stored in the first memory space fails in the polling period. It is understood that the actual situation may be determined according to specific requirements, and the details are not limited herein.
Specifically, the memory space of the DDR5 memory is cyclically detected based on the polling period (the memory space at least includes the first memory space), thereby realizing real-time detection of the fault of the DDR5 memory and improving the reliability of the DDR5 memory.
In one implementation, the polling period and the first memory space may be configured in the BIOS, the BMC, the OS, or the application program, where the first memory space is less than or equal to the data storage space of the DDR5 memory.
For example, the polling period may be a period of thirty minutes, one hour, six hours, twenty hours, or other time units, and is not limited herein.
The first memory space may be configured according to the address of the data storage space in the DDR5 memory in fig. 2 a. For example, any two addresses of the data storage space in the DDR5 memory are defined as a start address and a last address of a first memory space, respectively, in an implementation manner, the first memory space may be a data storage space of the entire DDR5 memory, and in other cases, may also be smaller than the data storage space in the DDR5 memory space, which is not limited herein.
In the embodiment of the application, the polling period and the first memory space are configured in the application program in the application layer of the BIOS, the BMC, the OS or the computing device, so that the diversity and the selectivity of the scheme are reflected. And the scheme is realized through software, so that the realization cost can be reduced as much as possible.
For convenience of understanding the present solution, the following description will take an example of detecting whether the first memory space fails in the polling period by the BIOS.
In a possible implementation manner, in the polling period, one data in the first memory space and the first check code corresponding to the data are obtained each time. And then, obtaining a corresponding second check code based on the obtained data, and determining that the data has a fault based on that the first check code and the second check code corresponding to the data are different, or determining that the data has no fault based on that the first check code and the second check code corresponding to the data are the same.
Illustratively, the BIOS may detect whether the data stored in the first memory space fails by traversing all data of the entire first memory space through an error check patrol (ECS). The ECS can be used for internal memory reading, correcting a single bit error, and then writing back the corrected data bits to the memory array, so that the accuracy of the data stored in the memory is maintained.
For example, please refer to fig. 3b, which illustrates an example of BIOS, to understand the implementation process of the ECS, and fig. 3b is a schematic diagram of detecting data failure according to an embodiment of the present disclosure. The method comprises the following steps:
s1, acquiring data and a corresponding first check code.
Specifically, the BIOS acquires one data and the first check code in the first memory space at a time, for example, the data is 128-bit data bits (i.e., N = 128), and the first check code is 8-bit check bits, it is understood that in other application scenarios, the data may be 64-bit data bits (i.e., N = 64), and in practical cases, the N-bit data bits included in the data may also be determined according to specific situations, which is not limited herein.
And S2, obtaining a second verification code based on the data.
Then the BIOS calculates a second check code, that is, a new 8-bit check bit, according to the obtained 128-bit data bit, and specifically, the first verification code is obtained based on data written in the DDR5 memory, so that the BIOS obtains the second verification code based on the 128-bit data obtained from the first memory space again according to the manner of obtaining the first check code. Specifically, the description of ECC in the foregoing example of fig. 1 is similar, and detailed description is omitted here.
And S3, determining that the data is in fault based on the first verification code and the second verification code being different, or determining that the data is not in fault based on the first verification code and the second verification code being the same.
Specifically, the BIOS compares the new 8-bit check bit, i.e., the second check code, with the acquired 8-bit check bit, i.e., the first check code, and when the two check codes are identical, there is no ECS error, i.e., there is no fault in the 128-bit data bit of the data, but when the two check codes are not identical, there is an ECS error, i.e., the data fails. When there is an ECS error, relevant information of the ECS error, that is, failure information, specifically, for example, a memory address of the failed data, a current row address error count value (the current row address error count value refers to the number of times that the current row of the memory address of the failed data has an error), and/or a current column address error count value (the current column address error count value refers to the number of times that the current column of the memory address of the failed data has an error) is recorded in a memory Mode (MR) register of the DDR5 memory, it can be understood that there may also be other failure-related information, which is not described herein again. Specifically, the MR register is illustrated by taking a DDR5 memory as an example, and the current DDR5 memory has 256 registers MR0 to MR255, each of which is composed of eight operation bits.
It should be noted that, in an implementation manner, in the polling period, the BIOS may traverse the entire first memory space one or more times, and in an actual situation, the BIOS may be determined according to a specific requirement, and is not limited herein.
It should be noted that, the foregoing example only takes the BIOS polling the first memory space as an example, and in other application scenarios, the specific implementation manner of polling the first memory space by using the BMC, the OS, or the CPU is similar to the specific implementation manner of polling the first memory space by using the BIOS, and details are not repeated here.
In the inspection process, when the data is detected to have faults, fault information corresponding to the fault data is obtained from the memory mode register. The fault information at least comprises a memory address corresponding to fault data, wherein a memory mode register is located in the DDR5 memory and used for storing a mapping relation between the data stored in the first memory space and the memory address. The method comprises the following steps 302:
302. and when the data is detected to have faults, obtaining fault information corresponding to the fault data.
In a possible implementation manner, when it is detected that the first data has a fault, the polling is stopped, and at least a first memory address corresponding to the first data is obtained from the memory mode register, where the first data is data stored in the first memory space.
For example, to facilitate understanding, the following still takes BIOS as an example to obtain fault information corresponding to faulty data for description, specifically refer to fig. 4a, and fig. 4a is a schematic diagram of data stored in a DDR5 memory according to an embodiment of the present application. The first data and/or the second data are/is stored in the DDR5 memory, and the storage medium is not limited herein.
When the BIOS determines that the first data in the first memory space fails, that is, an ECS error is detected, the BIOS triggers a System Management Interrupt (SMI) to stop polling, and acquires fault information corresponding to the first data from the memory mode register based on the currently detected failed first data, specifically, at least a first memory address of the first data, and for example, the BIOS may acquire the first memory address corresponding to the first data detected the ECS error from the MR16 register to the MR19 register.
In the embodiment of the application, the specific implementation mode that corresponding fault information is acquired when data has faults in the inspection process is specifically explained, and the reliability of the scheme is improved.
In a possible implementation manner, after at least a first memory address corresponding to first data is obtained from the memory mode register, second data in the first memory space is continuously patrolled, where the second data is data remaining in the first memory space to be patrolled.
For example, after the first memory address of the first data is obtained, the BIOS continues to patrol for second data, where the second data may be data remaining in the first memory space to be polled. In an implementation manner, the second data may be data next to the first data in the first memory space, or may be other data to be inspected at intervals, which is not limited herein.
In the polling cycle, the BIOS detects whether all data in the first memory space have a fault through the ECS, acquires fault information of the faulty data when the faulty data exists, and then continuously detects the remaining data to be polled, and continuously detects the remaining data to be polled when the currently detected data does not have a fault. The data of the whole first memory space can be ensured to be patrolled and examined, and the reliability of the memory data is ensured.
In the embodiment of the present application, after the steps 301 and 302 are executed, the memory address of the data with the fault in the first memory space can be obtained, so that the precise location of the memory fault can be accurately located, an additional hardware circuit is not required, the cost is reduced, and the method is suitable for more application scenarios.
In one implementation, the failure information may further include at least one of a current row address error count value or a current column address error count value, in addition to a memory address corresponding to the failed data. For example, when an ECS error is detected, the BIOS may obtain a memory address corresponding to the data failing from the MR16 register to the MR19 register, and/or obtain a current row address error count value and/or a current column address error count value from the MR20 register. It is understood that the BIOS may also obtain failure information corresponding to other failed data, and is not limited herein.
In the embodiment of the application, the memory address of the data with the fault in the first memory space and the fault information such as the error count value of the current row address and/or the error count value of the current column address are obtained, so that the severity and the accurate position of the memory fault can be accurately positioned, an additional hardware circuit is not required to be added, the cost is reduced, and the method is suitable for more application scenarios.
In a possible implementation manner, when a data failure is detected, error correction is performed on the failed data to obtain target data, and the target data is written into a memory address of the failed data. For example, please refer to step S4 and step S5 in the example of fig. 3 b:
and S4, correcting the fault data to obtain target data.
The BIOS supports correcting single bit errors in the failed data through the ECS, thereby obtaining the target data. Specifically, the bits with determined errors can be flipped and corrected, so as to obtain correct and fault-free data.
And S5, writing target data into the memory address of the fault data.
The ECS also supports the write back of error corrected data bits into the memory array so that the data stored in the memory maintains accuracy. Therefore, the BIOS can write the target data obtained after error correction into the memory address of the fault data in the first memory space through the ECS, and the accuracy of data stored in the DDR5 memory is guaranteed.
In one possible implementation, the number of failures is obtained, which is the number of failures occurring in the current polling period, and then the value of the polling period is adjusted based on the number of failures. And then the polling is continued in the polling period based on the reset.
For example, the BIOS may adjust the value of the polling period according to the number of failures or the frequency, specifically, the BIOS determines the value of the corresponding polling period according to whether the detected number of failures in the current polling period satisfies a set threshold, where different thresholds of the number of failures correspond to different values of the polling period, for example, the threshold is 20 corresponding to 24 hours of the polling period, the threshold is 50 corresponding to 6 hours of the polling period, and so on.
In one implementation, the BIOS may generate an alarm when the number of times of occurrence of the failure in the current inspection cycle satisfies a threshold, and prompt the user through the alarm, so that the user may adjust the value of the inspection cycle according to the current alarm. And/or the user may also take action to repair the failed data based on the alarm.
In the embodiment of this application, can adjust the value of patrolling and examining the cycle in a flexible way, improve the frequency of patrolling and examining when the number of times that breaks down is more, can be more accurate real-time monitoring data trouble the possibility, promote the reliability of the data of memory, promote the flexibility of scheme.
In a possible implementation manner, after at least two polling periods pass, a fault-free address interval in a first memory space in the at least two polling periods is obtained, and a memory space in the first memory space, which does not include the fault-free address interval, is determined to be a second memory space, that is, the second memory space is smaller than the first memory space. And in the next polling period, whether the data stored in the second memory space has a fault is detected.
For example, after the BIOS traverses all data of the first memory space through the ECS in at least two polling cycles, based on results of two ECS polling, the BIOS determines a non-fault address range of data that does not fail in the at least two polling cycles, for example, a start address of the first memory space is 0x00 and a last address is 0xdf, the non-fault address range in the first polling cycle includes 0x00 to 0x3f and 0x80 to 0xdf, the non-fault address range in the second polling cycle includes 0x00 to 0x4f and 0x90 to 0xdf, and then determines that the second memory space is 0x3f to 0x90. And then, detecting whether the data stored in the second memory space fails in a following patrol cycle, which is specifically similar to the detection of whether the data stored in the first memory space fails, and details are not repeated here.
It should be noted that the foregoing example of determining the second memory space is only used to illustrate the embodiment of the present application, and does not substantially limit the present application, and it should be understood that in actual situations, the second memory space may be determined according to specific situations, and the specific situation is not limited herein.
In the embodiment of the application, the second memory space is detected in the polling period, namely, after the polling period for a certain number of times, data of addresses without faults are not polled, so that the polling efficiency can be improved, and the occupation of equipment resources is reduced.
In one implementation, after the second memory space is patrolled in a certain number of patrol cycles, whether the data in the first memory space fails or not is detected again in the next patrol cycle. In an implementation manner, the certain number of times may be at least two times, and may be specifically determined according to an actual situation, for example, the failure frequency of the detection result of the foregoing multiple polling periods may be determined, when there is a large number of failures, all data in the first memory space may be detected three or five times later in the polling period, and when there is a small number of failures, all data in the first memory space may be detected 10 times or more later in the polling period. It is understood that the method can be determined according to actual specific requirements, and is not limited herein. And all data in the first memory space are detected again, so that the whole DDR5 memory can be guaranteed to be free of faults as much as possible, and the reliability of the memory data is improved.
When the BMC or the OS executes polling and acquires fault information, the severity and the accurate position of a data fault in the memory are accurately positioned, the data fault of the DDR5 memory is sensed, an additional hardware circuit is not required to be added, the cost is reduced, and the method is suitable for more application scenes.
When the CPU or the BIOS acquires the failure information in the inspection, it is further required to send the failure information to the BMC or the OS, specifically as described in step 303:
303. and sending fault information to the BMC or the OS.
For example, after obtaining the fault information, the BIOS further sends the fault information to the BMC or the OS, that is, the system may sense that the data stored in the DDR5 memory has a fault, and may determine a specific location of the data having the fault.
In a possible implementation manner, the BMC or the OS may also serve as a fault diagnosis system, and perform memory repair isolation on the memory address of the second plastic with the fault after performing analysis processing based on the received fault information.
For example, please refer to the application scenario shown in fig. 4b, and fig. 4b is a schematic diagram of the application scenario provided in the embodiment of the present application. The BIOS sends the acquired fault information to the fault diagnosis system, and can also transmit the fault information to a target device for fault diagnosis and prediction to obtain fault feature information, where the target device includes a fault big data training center for data, or the fault diagnosis system may send the fault information to the target device, and the specific details are not limited herein. The fault big data training center in the target equipment completes fault feature research by an algorithm for realizing machine learning fault predictive reasoning based on a large amount of running data, defines a fault feature model and then outputs fault features such as fault severity and/or the fault feature model to the fault diagnosis system.
The target device may be, for example, a database, a server, or other computer device capable of running a large number of data algorithms, which is not limited herein. And then the fault diagnosis system informs the CPU to execute the memory hard isolation repair based on the fault characteristics. In one implementation, the memory isolation repair may be a hard isolation repair or a soft isolation repair, which is not limited herein. For example, the CPU cuts off the power or signal of the granule of the memory address of the failed data, or the fault diagnosis system notifies the CPU to perform soft isolation repair through the OS based on the fault characteristics, for example, the granule of the memory address of the data defining the fault is not accessible, or the granule is identified as a fault state, it can be understood that other ways for achieving the same purpose are also possible, and are not limited herein.
In a possible implementation manner, after the fault information is obtained, the BIOS further sends the fault information to the target device, so that the target device determines a target address based on the fault information, and fault big data of the DDR5 memory is stored on the target device.
For example, the target device determines a target address based on the failure information, the target device may predict that data of the same target address exists in other DDRs 5 associated with the target device and have a failure risk based on the target address, and may synchronously perform memory repair isolation on the target addresses of the remaining DDR5 memories. And the predicted related DDR5 memory and the DDR5 memory which sends the fault information belong to the same product. In the embodiment of the application, the target device predicts that the same risk exists in other DDR5 memories of the same type through the target address, measures are taken, the application scene is increased, the working efficiency of the whole application scene is integrally improved, the fault routing inspection workload of single DDR5 memory data is reduced, and the reliability of the memory data is ensured as much as possible.
In the embodiment of the application, the BMC or the OS is used as a fault diagnosis system to accurately determine the position of fault data based on fault information and realize memory isolation and repair, so that a risky memory area is avoided as much as possible, and the reliability of data storage is ensured.
It should be noted that, the foregoing describes the embodiment of the present application by taking only BIOS as an example, and it is understood that the specific implementation manner of implementing the present solution by CPU, BMC, or OS is similar to the foregoing example of BIOS, and details are not described herein again.
In the embodiment of the application, whether the data stored in the first memory space fails or not is detected in the polling period, and in the polling process, when the data fails, the fault information corresponding to the failed data is acquired from the memory mode register and sent to the BMC or the OS, wherein the fault information comprises the memory address corresponding to the data indicating the fault, so that the data stored in the DDR5 can be accurately sensed at the accurate position where the fault occurs, no additional hardware circuit is needed, the cost is reduced, and the method is suitable for more application scenarios.
The following describes embodiments of the present application by taking an example of an application scenario in which the present solution is specifically implemented, so as to facilitate further understanding of the method provided by the present application.
For ease of understanding, the following description still uses the BIOS as an example, and specifically refers to fig. 5, and fig. 5 is another schematic diagram of an application scenario through which the embodiment of the present application is implemented. The disclosed device is provided with:
firstly, step 501 is executed to configure a polling cycle in BMC, BIOS, OS or application program in the computing device of fig. 2a as 24 hours (h), then the BIOS executes step 502 to detect whether there is an ECS error in data stored in the first memory space through the ECS in the polling cycle, then the BIOS determines whether there is an ECS error in step 503, when there is no ECS error, step 502 is executed to continue polling the first memory space until the entire first memory space is traversed, when there is an ECS error, the BIOS executes step 504 to stop polling, and obtains values from the MR16 register to the MR20 register, and obtains a memory address corresponding to the data including the fault, and fault information such as a current row address error count value and/or a current column address error count value. Then, step 502 is continuously executed to detect the remaining data to be inspected in the first memory space until the data stored in the whole first memory space is traversed. And the BIOS further executes step 505 to send the fault information obtained by the BIOS to the BMC, so that the BMC executes step 506 to obtain a fault feature through a fault big data training center based on the fault information, and notifies the CPU to implement memory isolation repair based on the fault feature.
In the application scene, the BIOS patrols and examines the whole first memory space through the ECS, then acquires fault information and sends the fault information to the BMC, so that the data accuracy under high speed and high density is guaranteed in real time, the accurate position of the fault data can be accurately positioned, an additional hardware circuit is not needed, the cost is reduced, memory isolation and repair are realized, a risky memory area is avoided as far as possible, and the reliability of data storage is guaranteed.
As shown in fig. 6, an embodiment of the present application further provides a processing apparatus, where the processing apparatus is applied in a computing device. Specifically, referring to fig. 6, fig. 6 is a schematic structural diagram of a processing device according to an embodiment of the present disclosure. In a possible implementation, the processing device 600 may include a module or a unit corresponding to one or more of the method/operation/steps/actions implemented in fig. 3a in the foregoing method embodiment, where the unit may be a hardware circuit, a software circuit, or a combination of a hardware circuit and a software circuit. In one possible implementation, the processing device 600 may include: acquisition unit 601, processing unit 602, and transmission unit 603. The processing unit 602 may be configured to perform the step of detecting whether the data stored in the first memory space fails in the polling period in the method embodiment described above, the obtaining unit 601 may be configured to perform the step of obtaining the failure information corresponding to the failed data from the memory mode register in the method embodiment described above, and the sending unit 603 may be configured to perform the step of sending the failure information to the BMC or the OS in the method embodiment described above.
In other possible designs, the obtaining unit 601, the processing unit 602, and the sending unit 603 may perform the methods/operations/steps/actions in the energy storage device in the above method embodiments in a one-to-one correspondence manner.
In a possible design, before detecting whether the first memory space fails in the polling cycle, the processing unit 602 is further configured to configure the value of the polling cycle and the first memory space in the BIOS or the BMC, where the first memory space is less than or equal to the data storage space of the DDR5 memory.
In a possible design, the processing unit 602 is specifically configured to stop the polling when it is detected that the first data has a fault, and obtain at least a first memory address corresponding to the first data from the memory mode register, where the first data is data stored in the first memory space.
In a possible design, the processing unit 602 stops polling when detecting that the first data has a fault, and is further configured to continue polling second data in the first memory space after at least the first memory address corresponding to the first data is obtained from the memory mode register, where the second data is data to be polled.
In a possible design, the obtaining unit 601 is further configured to obtain one data in the first memory space and the corresponding first check code each time in the polling cycle.
The processing unit 602 obtains a corresponding second check code based on the data, and then determines that the data has a fault based on that the first check code and the second check code corresponding to the data are different, or determines that the data has no fault based on that the first check code and the second check code corresponding to the data are the same.
In a possible design, the processing unit 602 is further configured to, when it is detected that there is a failure in the data, perform error correction on the failed data to obtain target data.
The sending unit 603 is further configured to write the target data into the memory address of the failed data.
In a possible design, the obtaining unit 601 is further configured to obtain the number of failures, where the number of failures is the number of failures occurring in the current polling period.
The processing unit 602 is further configured to adjust a value of the polling period based on the number of failures.
In a possible design, the obtaining unit 601 is further configured to obtain a non-failure address interval in the first memory space in at least two polling cycles after the at least two polling cycles.
The processing unit 602 is further configured to determine that a memory space in the first memory space that does not include the non-failure address interval is a second memory space, and detect whether data stored in the second memory space fails in a next polling period.
In a possible design, the failure information further includes a current row address error count value and/or a current column address error count value, the current row address error count value is used for indicating the number of times of row errors of the memory address of the failed data, and the current column address error count value is used for indicating the number of times of column errors of the memory address of the failed data.
In a possible design, the sending unit 603 is further configured to send failure information to a target device, so that the target device determines a target address based on the failure information, where the target device stores failure big data of data in the DDR5 memory.
For the beneficial effects of the processing apparatuses with various designs in the present application, please refer to the beneficial effects of the various implementation manners corresponding to one another in the method embodiment in fig. 3a, which are not described herein again in detail.
It should be noted that, the contents of information interaction, execution process, and the like between the modules/units in the processing apparatus corresponding to the embodiment in fig. 6 are based on the same concept as the method embodiment corresponding to fig. 3a in the present application, and specific contents may refer to the description in the foregoing method embodiment in the present application, and are not repeated herein.
In addition, functional modules or units in the embodiments of the present application may be integrated into one processor, may exist alone physically, or may be integrated into one module or unit by two or more modules or units. The integrated modules or units may be implemented in the form of hardware, or may be implemented in the form of software functional modules.
Next, another computing device provided in the embodiment of the present application is introduced, please refer to fig. 7, and fig. 7 is a schematic structural diagram of the computing device provided in the embodiment of the present application. In particular, the computing device 700 includes a CPU701, a memory 702, and a DDR5 memory 703, wherein the memory 702 may be transient or persistent. The program stored in the memory 702 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the computing device 700. Still further, the CPU701 may be configured to communicate with the memory 702 to execute a sequence of instruction operations in the memory 702 on the computing device 700.
In this embodiment, the CPU701 is configured to: whether data stored in a first memory space fails or not is detected in a polling period, the polling period is a period for detecting a DDR5 memory, the first memory space belongs to the DDR5 memory, the data comprises N bit data, and N is larger than or equal to 1. And in the routing inspection process, when a fault of data is detected, acquiring fault information corresponding to the fault data from a memory mode register, wherein the fault information at least comprises a memory address corresponding to the fault data, the memory mode register is located in the DDR5 memory and is used for storing the mapping relation between the data to be routed and the memory address and sending the fault data to the BMC or the OS. Therefore, the severity and the accurate position of the data fault in the memory can be accurately positioned, the data fault of the DDR5 memory can be sensed, an additional hardware circuit is not required to be added, the cost is reduced, and the method is suitable for more application scenes.
The method for detecting memory data failure provided in the embodiment of the present application is described in detail above, and a specific example is applied in the present application to explain the principle and the embodiment of the present application, and the description of the above embodiment is only used to help understand the method for detecting memory data failure and the core idea thereof in the present application. Meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
The embodiment of the present application further provides a computer-readable storage medium, which includes computer-readable instructions, when the computer-readable instructions are executed on a computer, the computer is caused to execute any implementation manner shown in the foregoing method embodiment.
The embodiments of the present application also provide a computer program product, which includes a computer program or instructions, when the computer program or instructions runs on a computer, the computer is caused to execute any implementation manner shown in the foregoing method embodiments.
The present application also provides a chip or chip system, which may include a processor. The chip may further include or be coupled with a memory (or a storage module) and/or a transceiver (or a communication module), where the transceiver (or the communication module) may be used to support wired and/or wireless communication of the chip, and the memory (or the storage module) may be used to store a program or a set of instructions that the processor calls for may be used to implement the operations performed by the terminal or the network device in any one of the possible implementations of the method embodiment and the method embodiment described above. The chip system may include the above chip, and may also include the above chip and other separate devices, such as a memory (or a storage module) and/or a transceiver (or a communication module).
It should be noted that the above-described embodiments of the apparatus are merely schematic, where units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units, may be located in one place, or may be distributed on multiple units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.
The embodiment of the present application further provides a computer-readable storage medium, which includes computer-readable instructions, when the computer-readable instructions are executed on a computer, the computer is caused to execute any implementation manner shown in the foregoing method embodiment.
Embodiments of the present application further provide a computer program product, which includes a computer program or instructions, when the computer program or instructions runs on a computer, the computer is caused to execute any implementation manner shown in the foregoing method embodiments.
The present application also provides a chip or chip system, which may include a processor. The chip may further include or be coupled with a memory (or a storage module) and/or a transceiver (or a communication module), where the transceiver (or the communication module) may be used to support wired and/or wireless communication of the chip, and the memory (or the storage module) may be used to store a program or a set of instructions that the processor calls for may be used to implement the operations performed by the terminal or the network device in any one of the possible implementations of the method embodiment and the method embodiment described above. The chip system may include the above chip, and may also include the above chip and other separate devices, such as a memory (or storage module) and/or a transceiver (or communication module).
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the methods of the embodiments of the present application.
Claims (10)
1. A method for detecting memory data failure, comprising:
detecting whether data stored in a first memory space fails or not in a polling period, wherein the polling period is a period for detecting a DDR5 memory, and the first memory space belongs to the DDR5 memory; the data comprises N bits of data, wherein N is greater than or equal to 1;
in the inspection process, when the data is detected to have a fault, fault information corresponding to the fault data is obtained from a memory mode register, and the fault information at least comprises a memory address corresponding to the fault data; the memory mode register is used for managing the DDR5 memory, and the memory mode register is used for storing the mapping relation between the data stored in the first memory space and the memory address;
and sending the fault information to an operating system OS or a baseboard management controller BMC.
2. The method of claim 1, wherein prior to said detecting whether the first memory space has failed during the patrol cycle, the method further comprises:
configuring the value of the polling period and the first memory space in a basic input/output system BIOS or BMC; the first memory space is smaller than or equal to the data storage space of the DDR5 memory.
3. The method according to claim 1 or 2, wherein in the polling process, when a fault of data is detected, acquiring fault information corresponding to the data from the memory mode register includes:
when a fault of first data is detected, stopping routing inspection, and at least acquiring a first memory address corresponding to the first data from the memory mode register; the first data is the data stored in the first memory space.
4. The method according to claim 3, wherein after stopping the polling and obtaining at least the first memory address corresponding to the first data from the memory mode register when the first data is detected to have a fault, the method further comprises:
and continuously polling second data in the first memory space, wherein the second data is the data to be polled.
5. The method of claim 1, wherein detecting whether the data stored in the first memory space fails during the patrol period comprises:
in the polling period, acquiring one data and a corresponding first check code in the first memory space each time;
obtaining a corresponding second check code based on the data;
if the first check code and the second check code corresponding to the data are different, determining that the data has a fault;
or, if the first check code and the second check code corresponding to the data are the same, determining that the data is free of faults.
6. The method according to any one of claims 1-5, further comprising:
acquiring the failure times, wherein the failure times are the times of failures occurring in the current inspection period;
and adjusting the value of the polling period based on the failure times.
7. The method according to any one of claims 1-6, wherein after at least two of the inspection cycles, the method further comprises:
acquiring fault-free address intervals in a first memory space in at least two routing inspection periods;
determining the memory space excluding the fault-free address interval in the first memory space as a second memory space;
and in the next routing inspection period, detecting whether the data stored in the second memory space has a fault.
8. The method of any of claims 1-7, wherein the fault information further comprises a current row address error count value; and/or a current column address error count value, where the current row address error count value is used to indicate the number of times of errors occurring in a row of the memory address of the failed data, and the current column address error count value is used to indicate the number of times of errors occurring in a column of the memory address of the failed data.
9. The method of any of claims 1-8, wherein after the obtaining the fault information in the memory mode register, the method further comprises:
sending the fault information to a target device, so that the target device determines a target address based on the fault information; and the target equipment stores fault big data of the data.
10. A computing device, comprising: a processor coupled to a memory, the memory to store instructions, the processor to execute the instructions to perform the method of any of claims 1-9, and a DDR5 memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210912646.1A CN115421948A (en) | 2022-07-30 | 2022-07-30 | Method for detecting memory data fault and related equipment thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210912646.1A CN115421948A (en) | 2022-07-30 | 2022-07-30 | Method for detecting memory data fault and related equipment thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115421948A true CN115421948A (en) | 2022-12-02 |
Family
ID=84195662
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210912646.1A Pending CN115421948A (en) | 2022-07-30 | 2022-07-30 | Method for detecting memory data fault and related equipment thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115421948A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116069538A (en) * | 2023-02-21 | 2023-05-05 | 宁畅信息产业(北京)有限公司 | Fault repairing method and device, electronic equipment and storage medium |
-
2022
- 2022-07-30 CN CN202210912646.1A patent/CN115421948A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116069538A (en) * | 2023-02-21 | 2023-05-05 | 宁畅信息产业(北京)有限公司 | Fault repairing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11010273B2 (en) | Software condition evaluation apparatus and methods | |
US20200159635A1 (en) | Memory Fault Detection | |
US11163623B2 (en) | Serializing machine check exceptions for predictive failure analysis | |
US8732532B2 (en) | Memory controller and information processing system for failure inspection | |
US8856620B2 (en) | Dynamic graduated memory device protection in redundant array of independent memory (RAIM) systems | |
EP0032957B1 (en) | Information processing system for error processing, and error processing method | |
US10062451B2 (en) | Background memory test apparatus and methods | |
KR102378466B1 (en) | Memory devices and modules | |
CN111414268A (en) | Fault processing method and device and server | |
CN113223603A (en) | Memory refresh control method, device, control circuit and memory device | |
CN115421948A (en) | Method for detecting memory data fault and related equipment thereof | |
CN115640174A (en) | Memory fault prediction method and system, central processing unit and computing equipment | |
CN117909109A (en) | Memory error information processing method and computing device | |
KR101448013B1 (en) | Fault-tolerant apparatus and method in multi-computer for Unmanned Aerial Vehicle | |
CN117950895A (en) | Memory error information resetting method, computing device and baseboard management controller | |
CN117687823A (en) | Memory fault type determining method and server | |
US20200111539A1 (en) | Information processing apparatus for repair management of storage medium | |
US9753806B1 (en) | Implementing signal integrity fail recovery and mainline calibration for DRAM | |
WO2018010084A1 (en) | Esd testing device, integrated circuit, and method applicable in digital integrated circuit | |
US10846162B2 (en) | Secure forking of error telemetry data to independent processing units | |
CN118093293B (en) | Storage failure detection and repair method and device in vehicle gauge chip | |
CN115686901B (en) | Memory fault analysis method and computer equipment | |
US5418794A (en) | Error determination scan tree apparatus and method | |
JP5381151B2 (en) | Information processing apparatus, bus control circuit, bus control method, and bus control program | |
WO2008062511A1 (en) | Multiprocessor system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |