WO2023050927A1 - Memory detection method and apparatus - Google Patents

Memory detection method and apparatus Download PDF

Info

Publication number
WO2023050927A1
WO2023050927A1 PCT/CN2022/101243 CN2022101243W WO2023050927A1 WO 2023050927 A1 WO2023050927 A1 WO 2023050927A1 CN 2022101243 W CN2022101243 W CN 2022101243W WO 2023050927 A1 WO2023050927 A1 WO 2023050927A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory
memory unit
detection
unit
Prior art date
Application number
PCT/CN2022/101243
Other languages
French (fr)
Chinese (zh)
Inventor
李玉伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023050927A1 publication Critical patent/WO2023050927A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, in particular to a memory detection method and device in computing equipment.
  • Memory is one of the more error-prone components of server motherboards. With the doubling of memory capacity and memory speed, the probability of memory errors will also double. How to better reduce memory errors is an important issue facing the new generation of servers.
  • the common memory detection scheme is to trigger memory detection when the server is powered on to detect the hard failure bits in the memory.
  • this method can only be executed when the server is powered on. Errors cannot be identified, which leads to low reliability of data processing in the server.
  • the present application provides a memory detection method and device, which are used to provide the flexibility of memory detection, improve the timeliness of memory detection, and improve system stability.
  • the present application provides a memory detection method, which can be applied to a computing device.
  • the computing device includes a memory controller and a memory, and the memory includes a plurality of memory units.
  • the memory controller is based on the first memory unit
  • the historical access record determines the access frequency of the first memory unit within a historical time period of a preset duration; wherein, the historical access record can be used to record information such as access requests (such as read requests or write requests) of the first memory unit;
  • the memory controller judges whether the access frequency satisfies a preset condition, and if the access frequency satisfies a preset condition, the memory controller can perform memory detection on the first memory unit to determine the detection result of the first memory unit.
  • the memory controller when the access frequency of the first memory unit satisfies the preset condition, the memory controller performs memory detection on the first memory unit to determine the detection result of the first memory unit.
  • the above method can determine whether to trigger memory detection based on the access frequency of the first memory unit when starting up or during operation, and is no longer limited to a specific timing. It provides flexibility in memory detection, can detect memory faults in time, and reduces system Risk of downtime, improve system stability and reliability.
  • the preset condition includes that the access frequency of the first memory unit is not greater than a preset threshold.
  • the access frequency of the first memory unit when the access frequency of the first memory unit is not greater than the preset threshold, then the data access to the first memory unit may be less in the future, or there may be no new data access.
  • memory detection is triggered on the first memory unit, which can reduce the congestion of data access requests as much as possible, and improve the read and write performance of the system on the basis of memory detection.
  • the memory controller performs memory detection on the first memory unit, and the memory detection may include data error detection, and the detection process includes such as:
  • the memory controller can first read the data stored in the first memory unit (that is, the first data), and store the read first data in the first memory of the memory controller, and store the first data in the first memory
  • the storage space for one data is called the first storage unit; after that, the memory controller checks the first data to determine whether there is a data error in the first data; the first case: there is no data error in the first data; the second Two cases: if there is a correctable error in the first data, the memory controller can correct the correctable error in the first data to obtain the error-corrected data (denoted as the second data), and store the second data Writing into the first storage unit; the third case: if there is an uncorrectable error in the first data, the memory controller can send error information, which can be used to indicate that there is an uncorrectable error in the first data.
  • the error data in the first data can be detected through the data error detection process, thereby improving the reliability of the data.
  • the memory controller when performing data error detection, it may also perform hardware detection on the first memory unit, and the detection process includes such as:
  • the memory controller detects whether there is a hard failure bit in the first memory unit, and the hard failure bit is a bit whose binary value is different from the binary value read; the first case: there are one or more hard failure bits in the first memory unit. failure bit, the memory controller sends failure information, and the failure information is used to indicate that there is a hard failure bit in the first memory unit, or one or more hard failure bits in the first memory unit are detected.
  • the hard failure bit in the first memory unit can be detected through the hardware detection process, so as to discover the hardware failure of the memory in time, reduce the probability of uncorrectable errors, reduce the risk of system downtime, and improve the stability and reliability of the system operation .
  • the memory controller may also receive a read request for accessing the first memory unit;
  • the error-corrected second data is sent to the processor. Or if no data error exists in the first data is detected through the data error detection process, the first data stored in the first storage unit may be returned in response to the read request. Or if there is an uncorrectable error in the first data, error information indicating that there is an uncorrectable error in the first data may be returned.
  • the memory controller may also receive a write request for accessing the first memory unit;
  • the data carried in the write request can be written into the first memory unit; or if it is detected that the first memory unit has a hard fail bit, the write The data carried in the request is written into the second memory unit, and the second memory unit is a memory unit other than the first memory unit among the plurality of memory units included in the aforementioned memory.
  • the memory controller performs memory detection on the first memory unit, where the memory detection may include hardware detection, and the detection process includes, for example:
  • the memory controller can first read the data stored in the first memory unit (that is, the first data), and store the read first data in the first memory of the memory controller, and store the first data in the first memory A data storage space is called the first storage unit; afterward, the memory controller detects whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit whose binary value is different from the binary value read; the first Case 1: There are one or more hard fail bits in the first memory unit, the memory controller sends a fault message indicating that there are hard fail bits in the first memory unit, or the detected One or more hard fail bits.
  • the second case the first memory unit does not have a hard fail bit, and after it is determined through the data error detection process that there is no erroneous data or correctable data in the first data, the first data or corrected data stored in the first storage unit The erroneous second data is written back to the first memory unit.
  • the hard failure bit in the first memory unit can be detected through the hardware detection process, so as to discover the hardware failure of the memory in time, reduce the probability of uncorrectable errors, reduce the risk of system downtime, and improve the stability and reliability of the system operation .
  • the method for the memory controller to detect whether there is a hard fail bit in the first memory unit includes:
  • the memory controller writes the first detection data into the first memory unit, and reads back the fourth data stored in the first memory unit; for each bit position, the first detection data is the same as the fourth data Compare the bit values on the bits of the position, if they are the same, it is determined that there is no hard failure in the bit of the first memory unit, otherwise, it is determined that there is a hard failure in the bit of the first memory unit.
  • the memory controller writes the second detection data into the first memory unit, and reads back the fifth data stored in the first memory unit; compares the second detection data with the bit values at the same position in the fifth data , to determine whether there is a hard fail bit in the first memory unit; wherein, the second detection data is different from the first detection data.
  • At least two hard fail bit detections are performed on the same memory unit to determine the hard fail bit in the first memory unit, reducing the probability of missed detection due to the fact that the read value and the output data of the hard fail bit are exactly the same .
  • the present application also provides a memory detection device, which has the function of realizing the behavior of the memory controller in the method example of the first aspect above.
  • a memory detection device which has the function of realizing the behavior of the memory controller in the method example of the first aspect above.
  • the functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the memory detection device includes a determination module, a detection module, and optionally, a reading module and a communication module. These modules can execute the memory controller in the method example of the first aspect above.
  • the corresponding functions please refer to the detailed description in the method example for details, and will not repeat them here.
  • the present application also provides a memory detection device, which has the function of realizing the behavior of the memory controller in the method example of the first aspect above, and the beneficial effects can be found in the description of the first aspect, which will not be repeated here.
  • the structure of the device includes a processor and a memory, and optionally, may also include a communication interface.
  • the processor is configured to support the memory detection device to execute the corresponding functions of the memory controller in the method of the first aspect above.
  • the memory is coupled to the processor and holds computer program instructions and data necessary for the communication device (eg at least one lock).
  • the structure of the memory detection device also includes a communication interface for communicating with other devices, such as receiving lock access requests.
  • the present application further provides a processor, where the processor includes a memory controller, and the memory controller is configured to realize the functions of the operation steps of the method in the first aspect or any possible implementation manner of the first aspect.
  • the present application also provides a computing device, the computing device includes a processor and a memory controller, and the memory controller has the function of realizing the behavior in the method example of the first aspect above, and the beneficial effects can be referred to in the first aspect The description will not be repeated here.
  • the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer can execute the above-mentioned first aspect and each possibility of the first aspect.
  • the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.
  • the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and implement the above first aspect and each possibility of the first aspect.
  • FIG. 1 is a schematic structural diagram of a server provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the architecture of a memory system
  • FIG. 3 is a schematic flowchart corresponding to a memory detection method provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of implementing a memory detection method provided in the embodiment of the present application in the time dimension;
  • FIG. 5 is a schematic flow diagram of a memory detection method provided in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a memory detection device provided in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another memory detection device provided by an embodiment of the present application.
  • the memory detection method provided by the embodiment of the present application can be applied to computing devices, such as servers, desktop computers, tablet computers, mobile phones, etc.
  • the embodiment of the present application does not limit the type of computing devices, and any device with memory is applicable to this application.
  • Application example. uses a server as an example to introduce the technical solution of the embodiment of the present application, and the server in the following text can be replaced by a computing device.
  • FIG. 1 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 10 includes at least a processor 110 and a memory 120 .
  • the processor 110 and the memory 120 are connected through a bus.
  • the processor 110 may be a central processing unit (central processing unit, CPU), or a specific integrated circuit (application specific integrated circuit, ASIC), or be configured as one or more integrated circuits. Only one processor 110 is shown in FIG. 1 . In practical applications, there are often multiple processors 110 . Wherein, when the processor 110 is a CPU, one CPU 110 may have one or more CPU cores. This embodiment does not limit the number of CPUs and the number of CPU cores.
  • the memory 120 can be used to temporarily store computer executable program codes and data.
  • the memory 120 has the characteristics of reading and writing data at any time, with a high speed, and can be used as a temporary data storage space for running applications.
  • Memory includes various types of memory, such as dynamic random access memory (dynamic random access memory, DRAM), double data rate synchronous dynamic random access memory (double data rate, DDR) and so on. In practical applications, multiple memories 120 and different types of memories 120 may be configured in the server 10 . This embodiment does not limit the quantity and type of the memory 120 .
  • a memory controller 1101 may also be provided in the processor 110 .
  • the memory controller 1101 is used for managing the memory 120 and communicating with the processor 110 , and in the server, data exchange is performed between the processor 110 and the memory 120 through the memory controller 1101 .
  • the memory controller 1101 receives the data write request sent by the processor 110 , it will store the data in the data write request in the memory 120 .
  • the memory controller 1101 receives the data read request sent by the processor 110 , may read data from the memory 120 according to the memory address carried in the data read request, and return the read data to the processor 110 .
  • FIG. 2 shows a schematic structural diagram of a memory system composed of a memory controller 1101 and a memory 120 .
  • the memory controller 1101 connects physical ranks through memory channels.
  • the granularity of data exchange between the memory controller 1101 and the memory 120 is called the minimum data unit, and several memory blocks that provide the minimum data unit are ranks. For example, when a rank includes 64 bits, if each storage unit (or memory particle) in the memory 120 ) includes 8 bits, then 8 storage units are required to form the 64 bits, that is, a rank includes 8 storage units, such as when the memory 120 includes 32 storage units, the memory 120 includes 4 ranks.
  • the memory controller 1101 may also be a device external to the processor 110 and connected to the processor 110 through a bus, and the memory controller 1101 may also implement the functions of the operation steps of the method described in this application.
  • the memory controller 1101 shown in FIG. 2 is located in the processor 110 as an example for illustration.
  • a cache 1102 can also be set in the memory controller 1101, and the cache 1102 can be used to temporarily store data read from the memory 120, and can also be used to store data write requests received by the memory controller 1101 data carried.
  • the cache 1102 can be a random access memory, static random access memory (static random access memory, SRAM) or DRAM, and the cache 1102 can also include other types of memory. The embodiment of the present application does not limit the type and quantity of the cache 1102.
  • the structure shown in FIG. 1 does not constitute a specific limitation on the server.
  • the terminal device may include more or fewer components than shown in the illustration, for example, the server may also include hard disks, bios components, etc., or some components may be combined, or some components may be split, or different layout of the components.
  • the memory controller 1101 may not be disposed in the processor 110, which is not limited in this embodiment of the present application.
  • Fig. 3 is a schematic flow chart corresponding to the memory detection method provided in the embodiment of the present application, as follows, the method is executed by the memory controller 1101 in Fig. 1 or Fig. 2 as an example, as shown in Fig. 3, the method includes:
  • Step 301 the memory controller 1101 monitors data access requests, and records the monitored data access requests to obtain historical access records;
  • the data access request includes the aforementioned data write request and data read request.
  • the memory controller 1101 obtains data access requests from the processor 110, or obtains data access requests from other components such as bios components. This embodiment of the present application does not do this limited.
  • the memory controller 1101 generates or updates historical access records based on the monitored data access requests.
  • the records can be used to record the access frequency of each memory unit.
  • the memory unit can be of a preset size, such as the memory unit can be the aforementioned rank, or each memory unit includes multiple ranks, etc., the embodiment of the present application does not limit the size of the memory unit, and the following uses rank as an example to introduce historical access records.
  • the historical access record may include, but not limited to: a memory identifier (such as a memory address) for uniquely identifying the memory unit, and an access frequency of the memory unit. Referring to Table 1, Table 1 is an example of a historical access record provided by the embodiment of the present application.
  • the visit frequency here may refer to the total visit frequency in a historical time window, and the length of the historical time window may be a preset length (such as called the first preset length), or the visit frequency may also refer to the statistical
  • the total accumulative access frequency received is not limited in this embodiment of the present application.
  • the historical access record may also include one or more of the following: latest access time, type of data access request (such as data read request, data write request) and so on. It is worth noting that if the type of data access request is not recorded in the historical access records, as shown in Table 1, the access frequency in Table 1 refers to the total access frequency of the memory unit being read and written, that is, no distinction is made between the monitored Whether the data access request is a data read request or a data write request, as long as the memory unit is accessed, the access frequency is increased by 1. In another implementation manner, the access frequency of the memory unit being read and/or the access frequency of the memory space being written may be separately recorded.
  • the read access frequency is based on the monitored data read requests to the memory unit, and similarly, the written access frequency is based on the monitored data write requests to the memory space. of.
  • the following uses the total access frequency counted based on data read requests and data write requests as shown in Table 1 as an example without distinguishing between data read requests and data write requests.
  • the historical access record may also be used to record the access time of each data access request of the memory unit, as shown in Table 2, which is another example of the historical access record.
  • step 302 the memory controller 1101 determines the access frequency of the first memory unit within a preset historical time period based on the historical access record.
  • the memory 120 includes a plurality of memory units, and the first memory unit is any one of the plurality of memory units.
  • a memory unit ie, the first memory unit
  • the first memory unit is taken as an example for illustration.
  • the memory controller 1101 can determine the access frequency of the first memory unit in a historical time period according to the historical access records, where the length of a historical time period can be a preset length (such as called a second preset length), as shown in FIG. As shown in 4, the historical time period may be a time period of a second preset length before T0 , and the access frequency of the first memory unit within the historical time period is determined according to the historical access records.
  • the length of a historical time period can be a preset length (such as called a second preset length), as shown in FIG.
  • the historical time period may be a time period of a second preset length before T0
  • the access frequency of the first memory unit within the historical time period is determined according to the historical access records.
  • the first preset duration and the second preset length in the aforementioned historical access records can be equal , of course, may also be different.
  • the first preset duration is longer than the second preset length, which is not limited in this embodiment of the present application.
  • the embodiment of the present application does not limit the duration of a certain period of time in the future.
  • Step 303 judging whether the access frequency of the first memory unit satisfies a preset condition, if so, execute step 304, otherwise, exit the process.
  • the preset condition includes but is not limited to: the access frequency is not greater than a preset threshold.
  • step 302 If the access frequency determined in step 302 is less than or equal to the preset threshold, it is determined that there will be no new data access request for the first memory unit for a period of time in the future. As in FIG. 4, if the access frequency of the first memory unit in the historical time period whose length is the second preset length is less than or equal to the preset threshold, it is determined that there is no access to the first memory unit in a period of time in the future after the current time T0 .
  • a new data access request of a memory unit may trigger memory detection on the first memory unit at time T 0 or at a certain time within a preset time range after T 0 (see step 304 ).
  • the preset threshold may be a positive integer greater than or equal to 0. It can be understood that when the preset threshold is 0, the probability of memory detection and data access request collisions can be reduced as much as possible.
  • step 304 the memory controller 1101 detects the first memory unit to determine the detection result of the first memory unit.
  • the memory detection method includes data detection and hardware detection, and the data detection can be used to detect whether there is a data error in the data read from the first memory unit.
  • Hardware detection can be used to detect whether there is a hard failure bit in the first memory unit.
  • the hard failure bit refers to the hardware failure of the memory bit. Whether the binary value of the bit is 0 or 1, the output value of this bit is a fixed value, such as only 0 or only 1.
  • the three memory detection methods are introduced as follows:
  • Detection method 1 performing data detection and hardware detection on the first memory unit
  • FIG. 5 is a schematic flowchart of a method corresponding to the first detection mode.
  • FIG. 5 shows the flow of the data detection method (step 502-step 505) and the flow of the hardware detection method (step 506-step 514).
  • the process includes:
  • Step 501 the memory controller 1101 reads the data in the first memory unit (denoted as first data), and stores the first data in the cache 1102 of the memory controller 1101 .
  • the first data consists of bit values stored in each bit of the first memory unit.
  • a cache line is the smallest unit of cached data in a Cache (such as the cache 1102), wherein the data size of a rank stored data can include one or more cachelines, as follows for convenience of description, assuming The data size of the data stored in the first memory unit is one cacheline.
  • the cache space used for storing the first data in the cache 1102 is called a first cacheline.
  • Step 502 detect whether there is a data error in the first data, if yes, execute step 503 , otherwise, execute step 512 .
  • the first data includes information bits and check bits
  • the memory controller 1101 can use the check bits of the first data to check the information bits of the first data based on a check algorithm to detect whether There is a data error. Specifically, based on whether the data error can be corrected, the data error includes a correctable error (correctable error, CE) and an uncorrectable error (uncorrectable error, UCE).
  • CE correctable error
  • UCE uncorrectable error
  • Step 503 the memory controller 1101 judges whether the data error is CE, if yes, execute step 504 , otherwise, execute step 505 .
  • Step 504 correct the erroneous data in the first data to obtain the second data, and write the second data back to the cache 1102 .
  • the first data is stored in the first cache line, and here the error-corrected second data can be written back into the first cache line.
  • Step 505 the memory controller 1101 reports the UCE information in the first data to an intelligent management unit (intelligent management unit, IMU).
  • IMU intelligent management unit
  • the IMU is a physical core and is used to process information reported by the UCE.
  • step 506 the memory controller 1101 writes test data (such as first test data) into the first memory unit.
  • step 507 the memory controller 1101 reads the data stored in the first memory unit (such as third data).
  • the third data can be read after the first test data is written at an interval of a preset time period (such as called a third preset time length), so as to verify the time capability of the first memory unit to correctly store data, or
  • a preset time period such as called a third preset time length
  • the time for reading back the third data may not be limited, for example, the third data may be read immediately after writing the first test data, which is not limited in this embodiment of the present application.
  • the similarities below will not be repeated.
  • step 508 the memory controller 1101 compares the first test data with the third data to determine whether there is a hard failure location in the first memory unit.
  • bit value on each bit of the same position in the first test data and the third data is compared, and if they are consistent, it is determined that the bit has no fault; otherwise, it is determined that the bit is a hard failure bit .
  • the bit value of the first bit of the first test data with the bit value of the first bit of the third data, and if they are the same, determine the first bit of the first memory unit No glitches. Assuming that the bit value on the first bit of the first test data is 0, if the bit value on the first bit of the third data is also 0, then the two are the same, determine the first memory cell There is no fault in each bit. If the bit value on the first bit of the first test data is 0, and the bit value on the first bit of the third data is 1, then the two are different, and the first bit of the first memory unit is determined is the hard fail bit.
  • Step 509 the memory controller 1101 writes the second test data into the first memory unit.
  • the second test data is different from the first test data, and the difference here may refer to different bit values at the same position.
  • the first test data is 001101011010
  • the second test data is 110010100101. It should be understood that this is only an example, and the number of test data is not limited. For example, when the size of the first memory unit is 8 bytes, for example, the first test data may be 0x5A5A5A, and the second test data may be x0A5A5A5.
  • step 510 the memory controller 1101 reads data stored in the first memory unit (such as fourth data).
  • Step 511 the memory controller 1101 compares the second test data with the fourth data to determine whether there is a hard failure location in the first memory unit.
  • FIG. 5 shows that two hard fail bit detections are performed on the first memory unit, but this is not limited in this embodiment of the present application.
  • at least two hard fail bit detections can be performed on the same memory unit to determine the hard fail bit in the first memory unit, thereby reducing the missed detection caused by the fact that the read value is exactly the same as the output data of the hard fail bit probability.
  • Step 512 the memory controller 1101 judges whether there is a hard fail bit in the first memory unit, if not, execute step 513 , otherwise, execute step 514 .
  • step 513 the memory controller 1101 writes the data stored in the first cache line back to the first memory unit.
  • Step 514 the memory controller 1101 sends the detection result of the hard fail bit in the first memory unit to the IMU.
  • data detection and hardware detection are two independent detection methods, so there is no strict timing limit between the data detection process (step 502-step 505) and the hardware detection process (step 506-step 514), such as data
  • the detection process and the hardware detection process can be executed in parallel, or the data detection process can be executed first and then the hardware detection process can be executed, or the hardware detection process can be executed first and then the data detection process can be executed.
  • This embodiment of the application does not limit this, and it can be understood What's more, when the data detection process and the hardware detection process are executed in parallel, the total time consumption of memory detection can be shortened.
  • the first memory unit takes the first memory unit as an example, and introduces the process of performing memory detection on the first memory unit.
  • the remaining memory units may be all memory units in the memory 120 except the first memory unit, or may be specified part of memory units except the first memory unit, which is not limited in this embodiment of the present application.
  • a data access request (including a data read request and a data write request) to the first memory unit may also be received.
  • the data access request from the processor 110 as an example as follows, respectively for The processing flow of the data read request and data write request received during the memory detection process is introduced:
  • a data read request is received during the memory detection process
  • the data read request may be received at any timing of the memory detection process shown in FIG. 5 , which is not limited in this embodiment of the present application. No matter at what time the data read request is received, there are the following response methods based on whether there is a data error in the first data:
  • step 502 If it is determined in step 502 that the first data has no data error, then in response to the data read request, the memory controller 1101 sends the first data stored in the first cache line of the cache 1102 to the processor 110 .
  • the data read request may be received before step 502 or after step 502. If it is received before step 502, it may wait for step 502 to be executed, and determine The data read request is responded to after the first data has no data error, so as to ensure that correct data is returned to the processor 110 and improve data reliability. Similarities will not be repeated below.
  • step 503 If it is determined in step 503 that there is a correctable error in the first data, then in response to the data read request, after writing the error-corrected second data into the first cache line in step 504, the memory controller 1101 writes the first cache line The second data stored in the row is sent to the processor 110 .
  • step 503 If it is determined in step 503 that there is an uncorrectable error in the first data, then in response to the data read request, the memory controller 1101 sends an error response.
  • a data write request is received during the memory detection process
  • the data write request may be received at any timing of the memory detection process shown in FIG. 5 , which is not limited in this embodiment of the present application. Regardless of when the data write request is received, there are the following response methods based on whether the first memory unit has a hard fail bit:
  • step 512 If it is determined in step 512 that the first memory unit does not have a hard fail bit, the memory controller 1101 writes the data carried in the data write request (for example, fifth data) into the first memory unit.
  • the fifth data carried in the data write request may be stored in the cache 1102 first, and in an optional implementation manner, the fifth data may be stored in the first In this way, when step 513 is executed, the fifth data can be directly written into the first memory unit.
  • the fifth data may also be stored in other cache lines, such as the second cache line, and after step 512 determines that the first memory unit has no hard fail bit, the second The cache line acquires the fifth data and writes it into the first memory unit, and step 513 is not executed again.
  • the fifth data carried in the data write request may be directly written into the first memory unit, and step 510b is not performed again.
  • the memory controller 1101 If it is determined in step 512 that the first memory unit has a hard fail bit, then in response to the data write request, the memory controller 1101 writes the fifth data in the data write request into the memory 120 except the first memory unit A new free memory unit, the new memory unit may be a memory unit that does not have a hard fail bit as determined by memory detection, or it may be a spare memory unit corresponding to the first memory unit. No limit.
  • the first operation detection method can not only detect the data error in the data stored in the first memory unit, and correct the error in time when there is data, but also detect the hardware of the first memory unit, and find the hard failure bit in time, Reduce the risk of system downtime, improve the stability and reliability of system operation, and also provide a way to respond to data access requests during the memory detection process, reducing the delay of data access requests as much as possible.
  • Detection method two only perform data detection on the first memory unit.
  • a data access request to the first memory unit may also be received, wherein, for the data read request received during the memory detection process
  • the response method please refer to the introduction of the response method to the data read request in the first detection method, which will not be repeated here.
  • the memory controller 1101 may directly write the fifth data carried in the data write request into the first memory unit, and may terminate the data detection process .
  • the second operation detection method can correct the data stored in the first memory unit, and provides a response method to the data access request during the data detection process, which can reduce the data access request during the memory detection process as much as possible. time delay and can improve data reliability.
  • Detection method three only perform hardware detection on the first memory unit.
  • step 501 For the detection process of performing hardware detection on the first memory unit, refer to the specific introduction of step 501, step 506-step 514 in FIG. 5 above, and details will not be repeated here.
  • a data access request to the first memory unit may be received, wherein, the response method of receiving a data write request during the memory detection process can be referred to the aforementioned The introduction of the response method to the data write request in the detection method 1 will not be repeated here.
  • the memory controller 1101 may send the first data stored in the first cache line to the processor 110 .
  • the third operation detection method can perform hardware detection on the first memory unit to detect the hard failure bit of the first memory unit, and provides a response method to data access requests during the hardware detection process, which can reduce memory usage as much as possible.
  • the delay of data access requests in the detection process can improve data reliability.
  • memory detection of the first memory unit may be triggered to determine the detection result of the first memory unit. It is possible to determine whether to trigger memory detection based on the access frequency of the first memory unit during startup or operation, so as to reduce the delay of data access requests to the first memory unit as much as possible, and the memory detection is no longer limited to specific timing, providing It improves the flexibility of memory detection, can detect memory faults in time, reduces the risk of system downtime, and improves the stability and reliability of system operation.
  • the embodiment of the present application further provides a memory detection device, which is used to execute the method performed by the memory controller in the method embodiment shown in FIG. 3 or FIG. 5 .
  • the memory detection device 600 includes a determination module 601 and a detection module 602.
  • it may also include a reading module 603 and a communication module 604.
  • each module directly passes The communication path establishes the connection.
  • a determination module 601 configured to determine the access frequency of the first memory unit within a historical time period of a preset duration based on the historical access records of the first memory unit; the first memory unit is one of the plurality of memory units Any one of the memory units; the historical access record is used to record the information of the access request for the first memory unit; it is also used to judge whether the access frequency satisfies the preset condition; for the specific implementation, please refer to the The description of step 302 will not be repeated here.
  • the detection module 602 is configured to perform memory detection on the first memory unit to determine a detection result of the first memory unit when the access frequency satisfies a preset condition.
  • a detection result of the first memory unit when the access frequency satisfies a preset condition.
  • the preset condition includes that the access frequency of the first memory unit is not greater than a preset threshold.
  • the reading module 603 is configured to read the first data stored in the first memory unit, and store the first data in the first storage unit of the first memory; for the specific implementation, please refer to The description of step 501 in FIG. 5 will not be repeated here.
  • the detection module 602 is configured to check whether there is a data error in the first data; if there is no error in the first data, then keep the first data stored in the first storage unit; or if the first data exists, it can Correcting the error CE, the detection module is also used to correct the first data to obtain the second data, and write the second data into the first storage unit; or if the first data If there is an uncorrectable error UCE, send error information for indicating the UCE through the communication module.
  • steps 502 to 505 in FIG. 5 please refer to the description of steps 502 to 505 in FIG. 5 , which will not be repeated here.
  • the detection module 602 is also used to detect whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit whose binary value written is different from the binary value read; if it exists, then Sending fault information through the communication module, the fault information is used to indicate one or more detected hard failure bits in the first memory unit; or if not present, and the first data has a CE, then The detection module is further configured to write the second data stored in the first storage unit into the first memory unit; or, if it does not exist and there is no error in the first data, then the The detection module is further configured to write the first data stored in the first storage unit into the first memory unit.
  • the specific implementation manner please refer to the description of steps 506 to 514 in FIG. 5 , which will not be repeated here.
  • the communication module 604 is configured to receive a read request for requesting to read the data stored in the first memory unit; if the first data has a CE, the determination module 601 also Obtaining the second data from the first storage unit in response to the read request, and sending the second data through the communication module 604; or if there is no error in the first data, the determining module 601 also uses The first data is acquired from the first storage unit in response to the read request, and the first data is sent through the communication module 604 .
  • the communication module 604 is configured to receive a write request, and the write request is used to request to write the third data carried in the write request into the first memory unit; if the first memory unit does not If there is a hard failure bit, the determination module 601 is also used to write the third data into the first memory unit; or if the first memory unit has a hard failure bit, the determination module 601 is also used to write the third data to
  • the second memory unit is inserted into the second memory unit, and the second memory unit is a memory unit except the first memory unit among the plurality of memory units included in the memory.
  • FIG. 7 is a schematic diagram of a computing device 700 provided in an embodiment of the present application.
  • the computing device 700 includes a processor 701 (processor 701 is provided with a memory controller 2) a communication interface 704, a storage medium 705, and a bus 706.
  • the processor 701 , the communication interface 704 and the storage medium 705 communicate through the bus 706 , or communicate through other means such as wireless transmission.
  • the memory medium 702 may be used to store computer-executable instructions
  • the memory controller 2 is used to execute the computer-executable instructions stored in the memory medium 702 .
  • the memory medium 702 stores computer-executable instructions, and the memory controller 2 can call the computer-executable instructions stored in the memory medium 702 to perform the following operations:
  • the access frequency of the first memory unit within a historical period of a preset duration; the first memory unit is any one of the plurality of memory units unit; the historical access record is used to record the information of the access request for the first memory unit;
  • memory detection is performed on the first memory unit to determine a detection result of the first memory unit.
  • the processor 701 integrates a memory controller.
  • the processor 701 may be a CPU, for example, a processor of an X86 architecture or a processor of an ARM architecture.
  • the processor 701 can also be other general-purpose processors, digital signal processors (digital signal processing, DSP), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components, system on chip (SoC), graphics processing unit (graphic processing unit, GPU), artificial intelligence (artificial intelligent, AI) chips, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the memory controller may also be a controller off-chip of the processor 701, configured to implement the same function as the above memory controller.
  • the memory medium 702 may include read-only memory and random-access memory, and provides instructions and data to the processor 701 .
  • Memory medium 702 may also include non-volatile random access memory.
  • Memory medium 702 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the memory medium 702 can also be a storage class memory SCM, and the SCM includes at least one of phase change memory PCM, magnetic random access memory MRAM, resistive random access memory RRAM, ferroelectric memory FRAM, fast NAND or nano random access memory NRAM .
  • the bus 706 may also include a power bus, a control bus, a status signal bus, and the like. However, for clarity of illustration, the various buses are labeled as bus 04 in the figure.
  • the bus 706 can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnection standard
  • EISA extended industry standard architecture
  • unified bus unified bus, Ubus or UB
  • CXL compute express link
  • CIX cache coherent interconnect for accelerators
  • the bus 706 can be divided into address bus, data bus, control bus and so on.
  • processor 701 is taken as an example in the computing device 700 shown in FIG.
  • the number of cores is not limited.
  • the memory controller 2 in the computing device 700 may correspond to the memory detection device 600 in the embodiment of the present application, and may correspond to the implementation of the method in the embodiment shown in FIG. 3 or FIG. 5 of the present application.
  • the corresponding subjects of the memory medium 702, and the above and other operations and/or functions of each module in the memory medium 702 are respectively for realizing the corresponding processes of each method, and for the sake of brevity, details are not repeated here.
  • the present application further provides a memory controller, which is used to implement the operation steps of the methods shown in FIG. 3 to FIG. 5 , and details are not described here for brevity.
  • the present application provides a computer-readable storage medium, in which computer programs or instructions are stored.
  • the computer executes the memory control method described in the above-mentioned method embodiments. The method executed by the device.
  • the present application provides a computer program product, the computer program product includes a computer program or instruction, when the computer program or instruction is executed by a computing device, the method performed by the memory controller in the above method embodiment is implemented .
  • the present application provides a chip, including at least one processor and an interface; the interface is used to provide program instructions or data for the at least one processor; the at least one processor is used to execute the
  • the above program line instructions are used to implement the method executed by the memory controller in the above method embodiment.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.
  • a magnetic medium such as a floppy disk, a hard disk, or a magnetic tape
  • an optical medium such as a DVD
  • a semiconductor medium such as a solid state disk (Solid State Disk, SSD)
  • the various illustrative logic units and circuits described in the embodiments of the present application can be implemented by a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, Discrete gate or transistor logic, discrete hardware components, or any combination of the above designed to implement or operate the described functions.
  • the general-purpose processor may be a microprocessor, and optionally, the general-purpose processor may also be any conventional processor, controller, microcontroller or state machine.
  • a processor may also be implemented by a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration to accomplish.
  • the steps of the method or algorithm described in the embodiments of the present application may be directly embedded in hardware, a software unit executed by a processor, or a combination of both.
  • the software unit may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other storage medium in the art.
  • the storage medium can be connected to the processor, so that the processor can read information from the storage medium, and can write information to the storage medium.
  • the storage medium can also be integrated into the processor.
  • the processor and storage medium can be provided in an ASIC.

Abstract

The present application provides a memory detection method and apparatus. The method can be applied to a computing device; the computing device comprises a memory controller and a memory; and in the method, the memory controller determines the access frequency of a first memory unit within a historical period of time on the basis of a historical access record of the first memory unit; and when the access frequency satisfies a preset condition, the memory controller performs memory detection on the first memory unit to determine a detection result of the first memory unit. According to the method, whether to trigger memory detection or not can be determined in combination with the access frequency of the first memory unit during startup or operation of the computing device, and the memory detection is no longer limited to a specific time, such that the flexibility of the memory detection is provided, a memory failure can be found in time, the risk of system crash is reduced, and the stability and reliability of system operation are improved.

Description

一种内存检测方法及装置A memory detection method and device
相关申请的交叉引用Cross References to Related Applications
本申请要求在2021年09月30日提交中国专利局、申请号为202111162543.X、申请名称为“一种内存检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111162543.X and the application name "A Memory Detection Method and Device" submitted to the China Patent Office on September 30, 2021, the entire contents of which are incorporated herein by reference Applying.
技术领域technical field
本申请涉及计算机技术领域,尤其涉及计算设备中一种内存检测方法及装置。The present application relates to the field of computer technology, in particular to a memory detection method and device in computing equipment.
背景技术Background technique
内存是服务器主板较容易出错的器件之一,随着内存容量以及内存速度的翻倍,内存出错的概率也将倍增。如何更好地减少内存错误是新一代服务器面临的重要问题。Memory is one of the more error-prone components of server motherboards. With the doubling of memory capacity and memory speed, the probability of memory errors will also double. How to better reduce memory errors is an important issue facing the new generation of servers.
目前,通用的内存检测方案为:在服务器上电开机时触发进行内存检测,以检测内存中的硬失效位,但是,这种方式只能在服务器开机时执行,对于服务器运行过程中出现的内存错误无法识别,这就导致服务器中数据处理的可靠性低。At present, the common memory detection scheme is to trigger memory detection when the server is powered on to detect the hard failure bits in the memory. However, this method can only be executed when the server is powered on. Errors cannot be identified, which leads to low reliability of data processing in the server.
发明内容Contents of the invention
本申请提供一种内存检测方法及装置,用于提供内存检测的灵活性,提高内存检测的及时性、提高系统稳定性。The present application provides a memory detection method and device, which are used to provide the flexibility of memory detection, improve the timeliness of memory detection, and improve system stability.
第一方面,本申请提供一种内存检测方法,该方法可以应用于计算设备,计算设备包括内存控制器和内存,内存包括多个内存单元,在该方法中,内存控制器基于第一内存单元的历史访问记录确定第一内存单元在一段预设时长的历史时间段内的访问频次;其中,历史访问记录可以用于记录第一内存单元的访问请求(如读请求或写请求)等信息;内存控制器判断该访问频次是否满足预设条件,若该访问频次满足预设条件,则内存控制器便可以对第一内存单元进行内存检测确定第一内存单元的检测结果。In a first aspect, the present application provides a memory detection method, which can be applied to a computing device. The computing device includes a memory controller and a memory, and the memory includes a plurality of memory units. In the method, the memory controller is based on the first memory unit The historical access record determines the access frequency of the first memory unit within a historical time period of a preset duration; wherein, the historical access record can be used to record information such as access requests (such as read requests or write requests) of the first memory unit; The memory controller judges whether the access frequency satisfies a preset condition, and if the access frequency satisfies a preset condition, the memory controller can perform memory detection on the first memory unit to determine the detection result of the first memory unit.
通过上述设计,当第一内存单元的访问频率满足预设条件时,内存控制器对第一内存单元执行内存检测确定第一内存单元的检测结果。上述方法,可以在开机启动时或运行过程中结合第一内存单元的访问频次决定是否触发内存检测,不再受限于特定时机,提供了内存检测的灵活性,可以及时发现内存故障,减少系统宕机风险,提高系统运行稳定性和可靠性。Through the above design, when the access frequency of the first memory unit satisfies the preset condition, the memory controller performs memory detection on the first memory unit to determine the detection result of the first memory unit. The above method can determine whether to trigger memory detection based on the access frequency of the first memory unit when starting up or during operation, and is no longer limited to a specific timing. It provides flexibility in memory detection, can detect memory faults in time, and reduces system Risk of downtime, improve system stability and reliability.
在一种可能的实现方法中,所述预设条件包括第一内存单元的访问频率不大于预设阈值。In a possible implementation method, the preset condition includes that the access frequency of the first memory unit is not greater than a preset threshold.
通过上述设计,当第一内存单元的访问频率不大于预设阈值时,那么未来一段时间内对该第一内存单元的数据访问也可能是较少,甚至是没有新的数据访问,则在该情况下触发对第一内存单元进行内存检测,可以尽量减少数据访问请求的拥塞,在内存检测的基础上,兼顾提高系统读写性能。Through the above design, when the access frequency of the first memory unit is not greater than the preset threshold, then the data access to the first memory unit may be less in the future, or there may be no new data access. Under certain circumstances, memory detection is triggered on the first memory unit, which can reduce the congestion of data access requests as much as possible, and improve the read and write performance of the system on the basis of memory detection.
在一种可能的实现方法中,内存控制器对第一内存单元执行内存检测,该内存检测可 以包括数据错误检测,该检测流程包括如:In a possible implementation method, the memory controller performs memory detection on the first memory unit, and the memory detection may include data error detection, and the detection process includes such as:
内存控制器可以首先读取第一内存单元中存储的数据(即为第一数据),并将读取到第一数据存储至内存控制器的第一存储器中,将第一存储器中存储该第一数据的存储空间称为第一存储单元;之后,内存控制器对第一数据进行校验,以确定第一数据中是否存在数据错误;第一种情况:第一数据不存在数据错误;第二种情况:第一数据存在可纠正错误,则内存控制器可以对第一数据中的可纠正错误进行纠错,以得到纠错后的数据(记为第二数据),并将第二数据写入第一存储单元;第三种情况:第一数据存在不可纠正错误,则内存控制器可以发送错误信息,该错误信息可以用于指示第一数据存在不可纠正错误等。The memory controller can first read the data stored in the first memory unit (that is, the first data), and store the read first data in the first memory of the memory controller, and store the first data in the first memory The storage space for one data is called the first storage unit; after that, the memory controller checks the first data to determine whether there is a data error in the first data; the first case: there is no data error in the first data; the second Two cases: if there is a correctable error in the first data, the memory controller can correct the correctable error in the first data to obtain the error-corrected data (denoted as the second data), and store the second data Writing into the first storage unit; the third case: if there is an uncorrectable error in the first data, the memory controller can send error information, which can be used to indicate that there is an uncorrectable error in the first data.
通过上述设计,通过数据错误检测流程可以检测到第一数据中的错误数据,提高数据的可靠性。Through the above design, the error data in the first data can be detected through the data error detection process, thereby improving the reliability of the data.
在一种可能的实现方法中,内存控制器将第一数据存储至第一存储单元之后,在执行数据错误检测时,还可以对第一内存单元执行硬件检测,该检测流程包括如:In a possible implementation method, after the memory controller stores the first data in the first storage unit, when performing data error detection, it may also perform hardware detection on the first memory unit, and the detection process includes such as:
内存控制器检测第一内存单元是否存在硬失效位,硬失效位为写入的二进制值与读取的二进制值不同的比特位;第一种情况:第一内存单元中存在一个或多个硬失效位,则内存控制器发送故障信息,该故障信息用于指示第一内存单元存在硬失效位,或检测到的第一内存单元中的一个或多个硬失效位。第二种情况:第一内存单元不存在硬失效位,并且通过数据错误检测流程确定第一数据中不存在错误数据或存在可纠正的数据之后,在第一存储单元存储的第一数据或纠错后的第二数据再写回至第一内存单元。The memory controller detects whether there is a hard failure bit in the first memory unit, and the hard failure bit is a bit whose binary value is different from the binary value read; the first case: there are one or more hard failure bits in the first memory unit. failure bit, the memory controller sends failure information, and the failure information is used to indicate that there is a hard failure bit in the first memory unit, or one or more hard failure bits in the first memory unit are detected. The second case: the first memory unit does not have a hard fail bit, and after it is determined through the data error detection process that there is no erroneous data or correctable data in the first data, the first data or corrected data stored in the first storage unit The erroneous second data is written back to the first memory unit.
通过上述设计,通过硬件检测流程可以检测第一内存单元中的硬失效位,以及时发现内存的硬件故障,降低出现不可纠正错误的概率,减少系统宕机风险,提高系统运行稳定性和可靠性。Through the above design, the hard failure bit in the first memory unit can be detected through the hardware detection process, so as to discover the hardware failure of the memory in time, reduce the probability of uncorrectable errors, reduce the risk of system downtime, and improve the stability and reliability of the system operation .
在一种可能的实现方法中,内存控制器将所述第一数据存储至所述第一存储单元之后,还可以会接收到用于访问第一内存单元的读请求;In a possible implementation method, after the memory controller stores the first data in the first storage unit, it may also receive a read request for accessing the first memory unit;
若通过数据错误检测流程检测到第一数据存在可纠正错误,则将纠错后的第二数据发送给处理器。或若通过数据错误检测流程检测到第一数据不存在数据错误,则响应该读请求,可以返回第一存储单元中存储的第一数据。或若第一数据存在不可纠正错误,则可以返回用于指示第一数据存在不可纠正错误的错误信息。If a correctable error exists in the first data is detected through the data error detection process, then the error-corrected second data is sent to the processor. Or if no data error exists in the first data is detected through the data error detection process, the first data stored in the first storage unit may be returned in response to the read request. Or if there is an uncorrectable error in the first data, error information indicating that there is an uncorrectable error in the first data may be returned.
通过上述设计,提供一种内存检测方法中响应数据读请求的灵活性,并提高数据可靠性。Through the above design, the flexibility of responding to the data read request in the memory detection method is provided, and the data reliability is improved.
在一种可能的实现方法中,内存控制器将所述第一数据存储至所述第一存储单元之后,还可以会接收到用于访问第一内存单元的写请求;In a possible implementation method, after the memory controller stores the first data in the first storage unit, it may also receive a write request for accessing the first memory unit;
若通过硬件检测流程检测到第一内存单元不存在硬失效位,则可以将该写请求中携带的数据写入第一内存单元;或若检测到第一内存单元存在硬失效位,则将写请求中携带的数据写入第二内存单元,第二内存单元为前述的内存所包括多个内存单元中除第一内存单元之外的一个内存单元。If it is detected through the hardware detection process that the first memory unit does not have a hard fail bit, the data carried in the write request can be written into the first memory unit; or if it is detected that the first memory unit has a hard fail bit, the write The data carried in the request is written into the second memory unit, and the second memory unit is a memory unit other than the first memory unit among the plurality of memory units included in the aforementioned memory.
通过上述设计,提供一种内存检测方法中响应数据写请求的灵活性,提高系统稳定性和数据可靠性。Through the above design, the flexibility of responding to the data write request in the memory detection method is provided, and the system stability and data reliability are improved.
在一种可能的实现方法中,内存控制器对第一内存单元执行内存检测,该内存检测可以包括硬件检测,该检测流程包括如:In a possible implementation method, the memory controller performs memory detection on the first memory unit, where the memory detection may include hardware detection, and the detection process includes, for example:
内存控制器可以首先读取第一内存单元中存储的数据(即为第一数据),并将读取到 第一数据存储至内存控制器的第一存储器中,将第一存储器中存储该第一数据的存储空间称为第一存储单元;之后,内存控制器检测第一内存单元是否存在硬失效位,硬失效位为写入的二进制值与读取的二进制值不同的比特位;第一种情况:第一内存单元中存在一个或多个硬失效位,则内存控制器发送故障信息,该故障信息用于指示第一内存单元存在硬失效位,或检测到的第一内存单元中的一个或多个硬失效位。第二种情况:第一内存单元不存在硬失效位,并且通过数据错误检测流程确定第一数据中不存在错误数据或存在可纠正的数据之后,在第一存储单元存储的第一数据或纠错后的第二数据再写回至第一内存单元。The memory controller can first read the data stored in the first memory unit (that is, the first data), and store the read first data in the first memory of the memory controller, and store the first data in the first memory A data storage space is called the first storage unit; afterward, the memory controller detects whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit whose binary value is different from the binary value read; the first Case 1: There are one or more hard fail bits in the first memory unit, the memory controller sends a fault message indicating that there are hard fail bits in the first memory unit, or the detected One or more hard fail bits. The second case: the first memory unit does not have a hard fail bit, and after it is determined through the data error detection process that there is no erroneous data or correctable data in the first data, the first data or corrected data stored in the first storage unit The erroneous second data is written back to the first memory unit.
通过上述设计,通过硬件检测流程可以检测第一内存单元中的硬失效位,以及时发现内存的硬件故障,降低出现不可纠正错误的概率,减少系统宕机风险,提高系统运行稳定性和可靠性。Through the above design, the hard failure bit in the first memory unit can be detected through the hardware detection process, so as to discover the hardware failure of the memory in time, reduce the probability of uncorrectable errors, reduce the risk of system downtime, and improve the stability and reliability of the system operation .
在一种可能的实现方法中,内存控制器检测第一内存单元是否存在硬失效位的方法包括:In a possible implementation method, the method for the memory controller to detect whether there is a hard fail bit in the first memory unit includes:
内存控制器将第一检测数据写入第一内存单元,并读回第一内存单元内存储的第四数据;针对每个位置的比特位,将第一检测数据和所述第四数据中相同位置的比特位上的比特值进行比对,若相同,则确定第一内存单元的该比特位不存在硬失效,否则,确定第一内存单元的比特位存在硬失效。内存控制器将第二检测数据写入第一内存单元,并读回第一内存单元存储的第五数据;将第二检测数据和第五数据中相同位置的比特位上的比特值进行比对,以确定第一内存单元中是否存在硬失效位;其中,所述第二检测数据与所述第一检测数据不同。The memory controller writes the first detection data into the first memory unit, and reads back the fourth data stored in the first memory unit; for each bit position, the first detection data is the same as the fourth data Compare the bit values on the bits of the position, if they are the same, it is determined that there is no hard failure in the bit of the first memory unit, otherwise, it is determined that there is a hard failure in the bit of the first memory unit. The memory controller writes the second detection data into the first memory unit, and reads back the fifth data stored in the first memory unit; compares the second detection data with the bit values at the same position in the fifth data , to determine whether there is a hard fail bit in the first memory unit; wherein, the second detection data is different from the first detection data.
通过上述设计,通过对同一内存单元执行至少两次硬失效位检测,以确定第一内存单元中的硬失效位,减少由于读取的值和硬失效位输出的数据恰好相同导致的漏检概率。Through the above design, at least two hard fail bit detections are performed on the same memory unit to determine the hard fail bit in the first memory unit, reducing the probability of missed detection due to the fact that the read value and the output data of the hard fail bit are exactly the same .
第二方面,本申请还提供了一种内存检测装置,该内存检测装置具有实现上述第一方面的方法实例中内存控制器行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述内存检测装置的结构中包括确定模块、检测模块,可选的,还可以包括读取模块、通信模块,这些模块可以执行上述第一方面方法示例中内存控制器的相应功能,具体参见方法示例中的详细描述,此处不做赘述。In the second aspect, the present application also provides a memory detection device, which has the function of realizing the behavior of the memory controller in the method example of the first aspect above. For the beneficial effects, please refer to the description of the first aspect and will not repeat them here. . The functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware. The hardware or software includes one or more modules corresponding to the above functions. In a possible design, the structure of the memory detection device includes a determination module, a detection module, and optionally, a reading module and a communication module. These modules can execute the memory controller in the method example of the first aspect above. For the corresponding functions, please refer to the detailed description in the method example for details, and will not repeat them here.
第三方面,本申请还提供了一种内存检测装置,该内存检测装置具有实现上述第一方面的方法实例中内存控制器行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述装置的结构中包括处理器和存储器,可选的,还可以包括通信接口。所述处理器被配置为支持所述内存检测装置执行上述第一方面方法中内存控制器相应的功能。所述存储器与所述处理器耦合,其保存所述通信装置必要的计算机程序指令和数据(如至少一个锁)。所述内存检测装置的结构中还包括通信接口,用于与其他设备进行通信,如可以接收锁访问请求。In the third aspect, the present application also provides a memory detection device, which has the function of realizing the behavior of the memory controller in the method example of the first aspect above, and the beneficial effects can be found in the description of the first aspect, which will not be repeated here. . The structure of the device includes a processor and a memory, and optionally, may also include a communication interface. The processor is configured to support the memory detection device to execute the corresponding functions of the memory controller in the method of the first aspect above. The memory is coupled to the processor and holds computer program instructions and data necessary for the communication device (eg at least one lock). The structure of the memory detection device also includes a communication interface for communicating with other devices, such as receiving lock access requests.
第四方面,本申请还提供一种处理器,所述处理器包括内存控制器,内存控制器用于实现上述第一方面或第一方面任何一种可能的实现方式中方法的操作步骤的功能。In a fourth aspect, the present application further provides a processor, where the processor includes a memory controller, and the memory controller is configured to realize the functions of the operation steps of the method in the first aspect or any possible implementation manner of the first aspect.
第五方面,本申请还提供了一种计算设备,该计算设备包括处理器和内存控制器,内存控制器具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的 描述此处不再赘述。In the fifth aspect, the present application also provides a computing device, the computing device includes a processor and a memory controller, and the memory controller has the function of realizing the behavior in the method example of the first aspect above, and the beneficial effects can be referred to in the first aspect The description will not be repeated here.
第六方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实施方式中所述的方法。In the sixth aspect, the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer can execute the above-mentioned first aspect and each possibility of the first aspect. The method described in the implementation of .
第七方面,本申请还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实施方式中所述的方法。In a seventh aspect, the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.
第八方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可能的实施方式中所述的方法。In the eighth aspect, the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and implement the above first aspect and each possibility of the first aspect. The method described in the implementation of .
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the foregoing aspects, the present application may further be combined to provide more implementation manners.
上述第二方面至第七方面及其实现方式的有益效果可以参考对第一方面的方法及其实现方式的有益效果的描述。For the beneficial effects of the above second to seventh aspects and their implementations, reference may be made to the description of the beneficial effects of the method of the first aspect and its implementations.
附图说明Description of drawings
图1为本申请实施例提供的一种服务器的结构示意图;FIG. 1 is a schematic structural diagram of a server provided in an embodiment of the present application;
图2为一种内存系统的架构示意图;FIG. 2 is a schematic diagram of the architecture of a memory system;
图3为本申请实施例提供的一种内存检测方法所对应的流程示意图;FIG. 3 is a schematic flowchart corresponding to a memory detection method provided in an embodiment of the present application;
图4为本申请实施例提供的一种内存检测方法在时间维度的实施示意图;FIG. 4 is a schematic diagram of implementing a memory detection method provided in the embodiment of the present application in the time dimension;
图5为本申请实施例提供的一种内存检测方法的流程示意图;FIG. 5 is a schematic flow diagram of a memory detection method provided in an embodiment of the present application;
图6为本申请实施例提供的一种内存检测装置的结构示意图;FIG. 6 is a schematic structural diagram of a memory detection device provided in an embodiment of the present application;
图7为本申请实施例提供的另一种内存检测装置的结构示意图。FIG. 7 is a schematic structural diagram of another memory detection device provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供的内存检测方法可以应用于计算设备中,如服务器、台式计算机、平板电脑、手机等,本申请实施例对计算设备的类型不做限定,任何具有内存的设备均适用于本申请实施例。如下以服务器为例,对本申请实施例的技术方案进行介绍,下文中的服务器均可以替换为计算设备。The memory detection method provided by the embodiment of the present application can be applied to computing devices, such as servers, desktop computers, tablet computers, mobile phones, etc. The embodiment of the present application does not limit the type of computing devices, and any device with memory is applicable to this application. Application example. The following uses a server as an example to introduce the technical solution of the embodiment of the present application, and the server in the following text can be replaced by a computing device.
图1为本申请实施例体提供的一种服务器的结构示意图。如图1所示,该服务器10至少包括处理器110、内存120。其中,处理器110和内存120之间通过总线连接。FIG. 1 is a schematic structural diagram of a server provided by an embodiment of the present application. As shown in FIG. 1 , the server 10 includes at least a processor 110 and a memory 120 . Wherein, the processor 110 and the memory 120 are connected through a bus.
其中,处理器110可以是一个中央处理器(central processing unit,CPU),或者是特定集成电路(application specific integrated circuit,ASIC),或者是被配置成的一个或多个集成电路。图1中仅示出了一个处理器110,在实际应用中,处理器110的数量往往有多个,其中,处理器110为CPU时,一个CPU110又可以具有一个或多个CPU核。本实施例不对CPU的数量,以及CPU核的数量进行限定。Wherein, the processor 110 may be a central processing unit (central processing unit, CPU), or a specific integrated circuit (application specific integrated circuit, ASIC), or be configured as one or more integrated circuits. Only one processor 110 is shown in FIG. 1 . In practical applications, there are often multiple processors 110 . Wherein, when the processor 110 is a CPU, one CPU 110 may have one or more CPU cores. This embodiment does not limit the number of CPUs and the number of CPU cores.
内存120,可以用于临时存储计算机可执行程序代码和数据。内存120具有随时读写数据,速度很快等特点,可以作为正在运行的应用程序的临时数据存储空间。内存包含多种类型的存储器,例如动态随机存取存储器(dynamic random access memory,DRAM)、双倍数据速率同步动态随机存储器(double data rate,DDR)等。实际应用中,服务器10 中可配置多个内存120,以及不同类型的内存120。本实施例不对内存120的数量和类型进行限定。The memory 120 can be used to temporarily store computer executable program codes and data. The memory 120 has the characteristics of reading and writing data at any time, with a high speed, and can be used as a temporary data storage space for running applications. Memory includes various types of memory, such as dynamic random access memory (dynamic random access memory, DRAM), double data rate synchronous dynamic random access memory (double data rate, DDR) and so on. In practical applications, multiple memories 120 and different types of memories 120 may be configured in the server 10 . This embodiment does not limit the quantity and type of the memory 120 .
在一个实施例中,处理器110内还可以设置内存控制器1101。其中,内存控制器1101,用于管理内存120以及与处理器110通信,在服务器内,处理器110与内存120之间通过内存控制器1101进行数据交换。如内存控制器1101接收到处理器110发送的数据写请求时,会将这些数据写请求中的数据保存在内存120中。又如,内存控制器1101接收处理器110发送的数据读请求,可以根据该数据读请求中携带的内存地址从内存120中读取数据,并将读取的数据返回给处理器110。In one embodiment, a memory controller 1101 may also be provided in the processor 110 . Wherein, the memory controller 1101 is used for managing the memory 120 and communicating with the processor 110 , and in the server, data exchange is performed between the processor 110 and the memory 120 through the memory controller 1101 . For example, when the memory controller 1101 receives the data write request sent by the processor 110 , it will store the data in the data write request in the memory 120 . For another example, the memory controller 1101 receives the data read request sent by the processor 110 , may read data from the memory 120 according to the memory address carried in the data read request, and return the read data to the processor 110 .
参见图2所示,图2示出了内存控制器1101和内存120所组成的内存系统结构示意图。在硬件上,内存控制器1101通过内存通道来连接物理rank。内存控制器1101与内存120交互数据的粒度称为最小数据单元,提供最小数据单元的几个内存块即为rank,例如,一个rank包括64bit时,若内存120中每个存储单元(或内存颗粒)包括8bit位时,那么就需要8个存储单元来形成该64bit,也即一个rank包括8个存储单元,如内存120包括32个存储单元时,该内存120包括4个rank。Referring to FIG. 2 , FIG. 2 shows a schematic structural diagram of a memory system composed of a memory controller 1101 and a memory 120 . In hardware, the memory controller 1101 connects physical ranks through memory channels. The granularity of data exchange between the memory controller 1101 and the memory 120 is called the minimum data unit, and several memory blocks that provide the minimum data unit are ranks. For example, when a rank includes 64 bits, if each storage unit (or memory particle) in the memory 120 ) includes 8 bits, then 8 storage units are required to form the 64 bits, that is, a rank includes 8 storage units, such as when the memory 120 includes 32 storage units, the memory 120 includes 4 ranks.
作为一种可能的实施方式,内存控制器1101也可以是处理器110外部的器件,与处理器110通过总线相连,内存控制器1101同样可以实现本申请所述方法的操作步骤的功能。为了便于描述,以下实施例中以图2所示的内存控制器1101位于处理器110中为例进行说明。As a possible implementation manner, the memory controller 1101 may also be a device external to the processor 110 and connected to the processor 110 through a bus, and the memory controller 1101 may also implement the functions of the operation steps of the method described in this application. For ease of description, in the following embodiments, the memory controller 1101 shown in FIG. 2 is located in the processor 110 as an example for illustration.
在一种实施方式中,内存控制器1101中还可以设置缓存1102,缓存1102可以用于暂时存储从内存120中读取的数据,还可以用于存储内存控制器1101接收到的数据写请求中携带的数据。缓存1102可以是随机存取存储器是静态随机存取存储器(static random access memory,SRAM)或者,DRAM,缓存1102还可以包括其他类型存储器,本申请实施例对缓存1102的类型和数量不做限定。In one embodiment, a cache 1102 can also be set in the memory controller 1101, and the cache 1102 can be used to temporarily store data read from the memory 120, and can also be used to store data write requests received by the memory controller 1101 data carried. The cache 1102 can be a random access memory, static random access memory (static random access memory, SRAM) or DRAM, and the cache 1102 can also include other types of memory. The embodiment of the present application does not limit the type and quantity of the cache 1102.
需要说明的是,(1)图1所示的结构并不构成对服务器的具体限定。在本申请另一些实施例中,终端设备可以包括比图示更多或更少的部件,如服务器还可以包括硬盘、bios组件等,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。(2)内存控制器1101也可以不设置在处理器110中,本申请实施例对此不做限定。It should be noted that (1) the structure shown in FIG. 1 does not constitute a specific limitation on the server. In other embodiments of the present application, the terminal device may include more or fewer components than shown in the illustration, for example, the server may also include hard disks, bios components, etc., or some components may be combined, or some components may be split, or different layout of the components. (2) The memory controller 1101 may not be disposed in the processor 110, which is not limited in this embodiment of the present application.
下面结合图3,以图1~图2所示的服务器架构为例,对本申请实施例提供的内存检测方法进行说明。图3为本申请实施例提供的内存检测方法所对应的流程示意图,如下以该方法由图1或图2中的内存控制器1101执行为例,如图3所示,该方法包括:The following describes the memory detection method provided by the embodiment of the present application with reference to FIG. 3 and taking the server architecture shown in FIGS. 1 to 2 as an example. Fig. 3 is a schematic flow chart corresponding to the memory detection method provided in the embodiment of the present application, as follows, the method is executed by the memory controller 1101 in Fig. 1 or Fig. 2 as an example, as shown in Fig. 3, the method includes:
步骤301,内存控制器1101监听数据访问请求,并记录监听到的数据访问请求以得到历史访问记录;Step 301, the memory controller 1101 monitors data access requests, and records the monitored data access requests to obtain historical access records;
具体的,数据访问请求包括前述的数据写请求、数据读请求。该内存控制器1101获取数据访问请求的途径有多种,如内存控制器1101从处理器110获取数据访问请求,或从其他组件如bios组件等获取数据访问请求,本申请实施例对此不做限定。Specifically, the data access request includes the aforementioned data write request and data read request. There are many ways for the memory controller 1101 to obtain data access requests. For example, the memory controller 1101 obtains data access requests from the processor 110, or obtains data access requests from other components such as bios components. This embodiment of the present application does not do this limited.
内存控制器1101基于监听到的数据访问请求生成或更新历史访问记录,示例性地,该记录可以用于记录各内存单元的访问频次,该内存单元可以是预设大小,如内存单元可以是前述的rank,或每个内存单元包括多个rank等,本申请实施例对内存单元的大小不做限定,如下以rank为例对历史访问记录进行介绍。示例性地,针对任一内存单元,该历史访问记录可以包括但不限于:用于唯一标识内存单元的内存标识(如内存地址)、内存单元 的访问频次。参见表1,表1为本申请实施例提供的一种历史访问记录的示例。The memory controller 1101 generates or updates historical access records based on the monitored data access requests. Exemplarily, the records can be used to record the access frequency of each memory unit. The memory unit can be of a preset size, such as the memory unit can be the aforementioned rank, or each memory unit includes multiple ranks, etc., the embodiment of the present application does not limit the size of the memory unit, and the following uses rank as an example to introduce historical access records. Exemplarily, for any memory unit, the historical access record may include, but not limited to: a memory identifier (such as a memory address) for uniquely identifying the memory unit, and an access frequency of the memory unit. Referring to Table 1, Table 1 is an example of a historical access record provided by the embodiment of the present application.
表1 历史访问记录Table 1 Historical access records
内存单元memory unit 访问频次Frequency of visits
rank_0rank_0 2020
rank_1rank_1 88
rank_2rank_2 11
rank_3rank_3 n(n为大于或等于0的正整数)n (n is a positive integer greater than or equal to 0)
这里的访问频次可以是指在一个历史时间窗口内的总访问频次,该历史时间窗口的长度可以是预设长度(如称为第一预设长度),或者,该访问频次也可以是指统计到的总的累计访问频次,本申请实施例对此也不做限定。The visit frequency here may refer to the total visit frequency in a historical time window, and the length of the historical time window may be a preset length (such as called the first preset length), or the visit frequency may also refer to the statistical The total accumulative access frequency received is not limited in this embodiment of the present application.
可选的,该历史访问记录还可以包括下列中的一项或多项:最近访问时间、数据访问请求的类型(如数据读请求、数据写请求)等。值得注意的是,若历史访问记录中未记录数据访问请求的类型,如表1所示,则表1中的访问频次是指该内存单元被读和写的总访问频次,即不区分监听到的数据访问请求是数据读请求还是数据写请求,只要访问该内存单元,则访问频次加1。在另一实施方式中,可以单独记录该内存单元被读的访问频次,和/或,该内存空间被写的访问频次。可以理解的是,被读的访问频次是基于监听到的对该内存单元的数据读请求统计到的,同理,被写的访问频次是基于监听到的对该内存空间的数据写请求统计到的。为便于说明,如下均以不区分数据读请求和数据写请求,即如表1所示基于数据读请求和数据写请求统计到的总访问频次为例进行说明。Optionally, the historical access record may also include one or more of the following: latest access time, type of data access request (such as data read request, data write request) and so on. It is worth noting that if the type of data access request is not recorded in the historical access records, as shown in Table 1, the access frequency in Table 1 refers to the total access frequency of the memory unit being read and written, that is, no distinction is made between the monitored Whether the data access request is a data read request or a data write request, as long as the memory unit is accessed, the access frequency is increased by 1. In another implementation manner, the access frequency of the memory unit being read and/or the access frequency of the memory space being written may be separately recorded. It can be understood that the read access frequency is based on the monitored data read requests to the memory unit, and similarly, the written access frequency is based on the monitored data write requests to the memory space. of. For the convenience of description, the following uses the total access frequency counted based on data read requests and data write requests as shown in Table 1 as an example without distinguishing between data read requests and data write requests.
在另一种可选的方式中,该历史访问记录还可以用于记录内存单元的每个数据访问请求的访问时间等,如表2所示为历史访问记录的另一种示例。In another optional manner, the historical access record may also be used to record the access time of each data access request of the memory unit, as shown in Table 2, which is another example of the historical access record.
表2.历史访问记录Table 2. Historical access records
内存单元memory unit 访问时间interview time
rank_0rank_0 10:0110:01
rank_0rank_0 10:0210:02
rank_0rank_0 10:0310:03
rank_0rank_0 10:0810:08
需要说明的是,表1和表2所示的历史访问记录仅为举例,本申请实施例对此不做限定,只要可以统计出内存单元在一段时间内的访问频次即可。It should be noted that the historical access records shown in Table 1 and Table 2 are only examples, which are not limited in this embodiment of the present application, as long as the access frequency of the memory unit within a period of time can be counted.
步骤302,内存控制器1101基于该历史访问记录确定第一内存单元在一段预设时长的历史时间段内的访问频次。In step 302, the memory controller 1101 determines the access frequency of the first memory unit within a preset historical time period based on the historical access record.
内存120包括多个内存单元,第一内存单元为该多个内存单元中的任意一个内存单元。这里是以一个内存单元(即第一内存单元)为例进行说明。The memory 120 includes a plurality of memory units, and the first memory unit is any one of the plurality of memory units. Here, a memory unit (ie, the first memory unit) is taken as an example for illustration.
内存控制器1101可以根据历史访问记录确定第一内存单元在一段历史时间段内的访问频次,这里的一段历史时间段的长度可以是预设长度(如称为第二预设长度),如图4所示,该历史时间段可以是T 0之前一段长度为第二预设长度的时间段,根据历史访问记录确定在该历史时间段内第一内存单元的访问频次。 The memory controller 1101 can determine the access frequency of the first memory unit in a historical time period according to the historical access records, where the length of a historical time period can be a preset length (such as called a second preset length), as shown in FIG. As shown in 4, the historical time period may be a time period of a second preset length before T0 , and the access frequency of the first memory unit within the historical time period is determined according to the historical access records.
值得注意的是,(1)为便于确定内存单元在长度为第二预设长度的历史时间段内的访 问频次,前述的历史访问记录中的第一预设时长和第二预设长度可以相等,当然也可以不等,如第一预设时长大于第二预设长度,本申请实施例对此不做限定。(2)本申请实施例对未来一段时间的时长不做限定。It is worth noting that (1) in order to facilitate the determination of the access frequency of the memory unit in the historical time period whose length is the second preset length, the first preset duration and the second preset length in the aforementioned historical access records can be equal , of course, may also be different. For example, the first preset duration is longer than the second preset length, which is not limited in this embodiment of the present application. (2) The embodiment of the present application does not limit the duration of a certain period of time in the future.
步骤303,判断该第一内存单元的访问频率是否满足预设条件,若满足,则执行步骤304,否则,退出流程。 Step 303, judging whether the access frequency of the first memory unit satisfies a preset condition, if so, execute step 304, otherwise, exit the process.
示例性地,预设条件包括但不限于:访问频次不大于预设阈值。Exemplarily, the preset condition includes but is not limited to: the access frequency is not greater than a preset threshold.
若步骤302确定的访问频次小于或等于预设阈值,则确定未来一段时间没有对第一内存单元的新的数据访问请求。如在图4中,若第一内存单元在长度为第二预设长度的历史时间段内的访问频次小于或等于该预设阈值,则确定当前时刻T 0之后的未来一段时间内没有对第一内存单元的新的数据访问请求,则在T 0时刻或T 0之后预设时间范围内的某时刻便可以触发对第一内存单元进行内存检测(参见步骤304)。 If the access frequency determined in step 302 is less than or equal to the preset threshold, it is determined that there will be no new data access request for the first memory unit for a period of time in the future. As in FIG. 4, if the access frequency of the first memory unit in the historical time period whose length is the second preset length is less than or equal to the preset threshold, it is determined that there is no access to the first memory unit in a period of time in the future after the current time T0 . A new data access request of a memory unit may trigger memory detection on the first memory unit at time T 0 or at a certain time within a preset time range after T 0 (see step 304 ).
具体的,该预设阈值可以是大于或等于0的正整数。可以理解的是,该预设阈值为0时,可以尽可能降低内存检测和数据访问请求发生碰撞的概率。Specifically, the preset threshold may be a positive integer greater than or equal to 0. It can be understood that when the preset threshold is 0, the probability of memory detection and data access request collisions can be reduced as much as possible.
步骤304,内存控制器1101对第一内存单元进行检测确定第一内存单元的检测结果。In step 304, the memory controller 1101 detects the first memory unit to determine the detection result of the first memory unit.
具体的,内存检测方式包括数据检测和硬件检测,数据检测可以用于检测从第一内存单元中读取出的数据是否存在数据错误。硬件检测可以用于检测第一内存单元是否存在硬失效位,硬失效位是指内存位的硬件存在故障,如第一内存单元中的某比特位发生了硬失效时,不论写入该比特位的二进制值是0还是1,该比特位的输出值都是固定值,如只能为0或者只能为1。Specifically, the memory detection method includes data detection and hardware detection, and the data detection can be used to detect whether there is a data error in the data read from the first memory unit. Hardware detection can be used to detect whether there is a hard failure bit in the first memory unit. The hard failure bit refers to the hardware failure of the memory bit. Whether the binary value of the bit is 0 or 1, the output value of this bit is a fixed value, such as only 0 or only 1.
如下介绍三种内存检测方式:The three memory detection methods are introduced as follows:
检测方式一:对第一内存单元执行数据检测和硬件检测;Detection method 1: performing data detection and hardware detection on the first memory unit;
图5为该检测方式一所对应的方法流程示意图。图5示出了该数据检测方法的流程(步骤502~步骤505)和硬件检测方法的流程(步骤506~步骤514)。FIG. 5 is a schematic flowchart of a method corresponding to the first detection mode. FIG. 5 shows the flow of the data detection method (step 502-step 505) and the flow of the hardware detection method (step 506-step 514).
如图5所示,该流程包括:As shown in Figure 5, the process includes:
步骤501,内存控制器1101读取第一内存单元中的数据(记为第一数据),将第一数据存放至内存控制器1101的缓存1102中。 Step 501 , the memory controller 1101 reads the data in the first memory unit (denoted as first data), and stores the first data in the cache 1102 of the memory controller 1101 .
第一数据即由第一内存单元的每个比特位上所存储的比特的值组成。The first data consists of bit values stored in each bit of the first memory unit.
本领域技术人员可以确定缓存行(cacheline)是Cache(如缓存1102)缓存数据的最小单位,其中,一个rank所存储的数据的数据量大小可以包括一个或多个cacheline,如下为便于说明,假设第一内存单元所存储的数据的数据量大小为一个cacheline,对应的,将缓存1102中用于存储第一数据的缓存空间称为第一缓存行。Those skilled in the art can determine that a cache line (cacheline) is the smallest unit of cached data in a Cache (such as the cache 1102), wherein the data size of a rank stored data can include one or more cachelines, as follows for convenience of description, assuming The data size of the data stored in the first memory unit is one cacheline. Correspondingly, the cache space used for storing the first data in the cache 1102 is called a first cacheline.
步骤502,检测第一数据是否存在数据错误,若存在,则执行步骤503,否则,执行步骤512。 Step 502 , detect whether there is a data error in the first data, if yes, execute step 503 , otherwise, execute step 512 .
在一种实施方式中,第一数据包括信息位和校验位,内存控制器1101可以基于校验算法,使用第一数据的校验位对第一数据的信息位进行校验,以检测是否存在数据错误。具体的,基于数据错误是否可以被纠正,数据错误包括可纠正错误(correctable error,CE)和不可纠正错误(uncorrectable error,UCE)。In one embodiment, the first data includes information bits and check bits, and the memory controller 1101 can use the check bits of the first data to check the information bits of the first data based on a check algorithm to detect whether There is a data error. Specifically, based on whether the data error can be corrected, the data error includes a correctable error (correctable error, CE) and an uncorrectable error (uncorrectable error, UCE).
步骤503,内存控制器1101判断该数据错误是否为CE,如果是,则执行步骤504,否则,执行步骤505。 Step 503 , the memory controller 1101 judges whether the data error is CE, if yes, execute step 504 , otherwise, execute step 505 .
步骤504,纠正第一数据中的错误数据,以得到第二数据,并将第二数据写回缓存1102。 Step 504 , correct the erroneous data in the first data to obtain the second data, and write the second data back to the cache 1102 .
如前所述的,第一数据存储于第一缓存行中,这里可以将纠错后的第二数据写回至第一缓存行中。As mentioned above, the first data is stored in the first cache line, and here the error-corrected second data can be written back into the first cache line.
步骤505,内存控制器1101将第一数据中的UCE信息上报至智能管理单元(intellignt management unit,IMU)。 Step 505, the memory controller 1101 reports the UCE information in the first data to an intelligent management unit (intelligent management unit, IMU).
IMU,为物理核,用于处理UCE上报信息。The IMU is a physical core and is used to process information reported by the UCE.
上述介绍的为数据检测方法流程,如下介绍硬件检测方法流程。The flow of the data detection method is described above, and the flow of the hardware detection method is introduced as follows.
步骤506,内存控制器1101向第一内存单元中写入测试数据(如称为第一测试数据)。In step 506, the memory controller 1101 writes test data (such as first test data) into the first memory unit.
步骤507,内存控制器1101读取第一内存单元中存储的数据(如称为第三数据)。In step 507, the memory controller 1101 reads the data stored in the first memory unit (such as third data).
示例性地,可以在写入第一测试数据间隔预设时长(如称为第三预设时长)之后,再读取第三数据,以验证第一内存单元能够正确存储数据的时间能力,也可以不限定读回第三数据的时间,如可以在写入第一测试数据之后立即读取第三数据,本申请实施例对此不做限定。下文类似之处,不再赘述。Exemplarily, the third data can be read after the first test data is written at an interval of a preset time period (such as called a third preset time length), so as to verify the time capability of the first memory unit to correctly store data, or The time for reading back the third data may not be limited, for example, the third data may be read immediately after writing the first test data, which is not limited in this embodiment of the present application. The similarities below will not be repeated.
步骤508,内存控制器1101将第一测试数据和第三数据进行比对,以确定第一内存单元是否存在的硬失效位置。In step 508, the memory controller 1101 compares the first test data with the third data to determine whether there is a hard failure location in the first memory unit.
示例性地,将第一测试数据和第三数据中每个相同位置的比特位上的比特值进行比对,若一致,则确定该比特位没有故障;否则,确定该比特位为硬失效位。Exemplarily, the bit value on each bit of the same position in the first test data and the third data is compared, and if they are consistent, it is determined that the bit has no fault; otherwise, it is determined that the bit is a hard failure bit .
举例来说,将第一测试数据的第一个比特位的比特值和第三数据的第一个比特位的比特值进行比对,若相同,则确定第一内存单元的第一个比特位没有故障。如假设第一测试数据的第一个比特位上的比特值为0,若第三数据的第一个比特位上的比特值也为0,则两者相同,确定第一内存单元的第一个比特位不存在故障。若第一测试数据的第一个比特位上的比特值为0,第三数据的第一个比特位上的比特值为1,则两者不同,确定第一内存单元的第一个比特位为硬失效位。之后,再将第一测试数据的第二个比特位的比特值和第三数据的第二个比特位的比特值进行比对,依此类推,基于该方式对每个比特位进行检测,以确定第一内存单元中的所有的硬失效位,当然,第一内存单元也可能不存在硬失效位。For example, compare the bit value of the first bit of the first test data with the bit value of the first bit of the third data, and if they are the same, determine the first bit of the first memory unit No glitches. Assuming that the bit value on the first bit of the first test data is 0, if the bit value on the first bit of the third data is also 0, then the two are the same, determine the first memory cell There is no fault in each bit. If the bit value on the first bit of the first test data is 0, and the bit value on the first bit of the third data is 1, then the two are different, and the first bit of the first memory unit is determined is the hard fail bit. Afterwards, compare the bit value of the second bit of the first test data with the bit value of the second bit of the third data, and so on, each bit is detected based on this method, to All hard fail bits in the first memory unit are determined, of course, there may be no hard fail bits in the first memory unit.
步骤509,内存控制器1101向第一内存单元中写入第二测试数据。 Step 509, the memory controller 1101 writes the second test data into the first memory unit.
应注意的是,第二测试数据和第一测试数据不同,这里的不同可以是指相同位置的比特位上的比特值不同。如第一测试数据为001101011010,第二测试数据为110010100101,应理解的是,这里仅为举例,并非限定测试数据的位数。如第一内存单元的大小为8字节时,示例性地,第一测试数据可以是0x5A5A5A,第二测试数据可以是x0A5A5A5。It should be noted that the second test data is different from the first test data, and the difference here may refer to different bit values at the same position. For example, the first test data is 001101011010, and the second test data is 110010100101. It should be understood that this is only an example, and the number of test data is not limited. For example, when the size of the first memory unit is 8 bytes, for example, the first test data may be 0x5A5A5A, and the second test data may be x0A5A5A5.
需要说明的是,上述示例仅为举例,本申请实施例对第一测试数据和第二测试数据的值不做限定。It should be noted that the above example is only an example, and the embodiment of the present application does not limit the values of the first test data and the second test data.
步骤510,内存控制器1101读取第一内存单元中存储的数据(如称为第四数据)。In step 510, the memory controller 1101 reads data stored in the first memory unit (such as fourth data).
步骤511,内存控制器1101将第二测试数据和第四数据进行比对,以确定第一内存单元是否存在的硬失效位置。 Step 511 , the memory controller 1101 compares the second test data with the fourth data to determine whether there is a hard failure location in the first memory unit.
具体比对方式请参见上述步骤508的介绍,此处不再赘述。For the specific comparison method, please refer to the introduction of the above step 508, which will not be repeated here.
值得注意的是,图5示出了对第一内存单元执行两次硬失效位检测,但本申请实施例对此不做限定。本申请实施例可以通过对同一内存单元执行至少两次硬失效位检测,以确定第一内存单元中的硬失效位,减少由于读取的值和硬失效位输出的数据恰好相同导致的 漏检概率。It should be noted that FIG. 5 shows that two hard fail bit detections are performed on the first memory unit, but this is not limited in this embodiment of the present application. In this embodiment of the present application, at least two hard fail bit detections can be performed on the same memory unit to determine the hard fail bit in the first memory unit, thereby reducing the missed detection caused by the fact that the read value is exactly the same as the output data of the hard fail bit probability.
步骤512,内存控制器1101判断第一内存单元是否存在硬失效位,若不存在则执行步骤513,否则,执行步骤514。 Step 512 , the memory controller 1101 judges whether there is a hard fail bit in the first memory unit, if not, execute step 513 , otherwise, execute step 514 .
步骤513,内存控制器1101将第一缓存行中存储的数据写回第一内存单元。In step 513, the memory controller 1101 writes the data stored in the first cache line back to the first memory unit.
步骤514,内存控制器1101将第一内存单元中的硬失效位的检测结果发送至IMU。 Step 514, the memory controller 1101 sends the detection result of the hard fail bit in the first memory unit to the IMU.
需要说明的是,数据检测与硬件检测为两个独立的检测方法,则数据检测流程(步骤502~步骤505)和硬件检测流程(步骤506~步骤514)之间没有严格的时序限定,如数据检测流程和硬件检测流程可以并行执行,也可以是先执行数据检测流程再执行硬件检测流程,或者也可以先执行硬件检测流程再执行数据检测流程,本申请实施例对此不做限定,可以理解的是,数据检测流程和硬件检测流程并行执行时可以缩短内存检测的总耗时。It should be noted that data detection and hardware detection are two independent detection methods, so there is no strict timing limit between the data detection process (step 502-step 505) and the hardware detection process (step 506-step 514), such as data The detection process and the hardware detection process can be executed in parallel, or the data detection process can be executed first and then the hardware detection process can be executed, or the hardware detection process can be executed first and then the data detection process can be executed. This embodiment of the application does not limit this, and it can be understood What's more, when the data detection process and the hardware detection process are executed in parallel, the total time consumption of memory detection can be shortened.
上文以第一内存单元为例,介绍了对第一内存单元执行内存检测的流程,对于内存120中除第一内存单元之外的其余内存单元的检测方式可以参照于前述的介绍,此处不再赘述。其中,其余内存单元可以是内存120中除第一内存单元之外的全部内存单元也可以是除第一内存单元之外指定的部分内存单元,本申请实施例对此不做限定。The above takes the first memory unit as an example, and introduces the process of performing memory detection on the first memory unit. For the detection methods of the other memory units in the memory 120 except the first memory unit, you can refer to the foregoing introduction. Here No longer. Wherein, the remaining memory units may be all memory units in the memory 120 except the first memory unit, or may be specified part of memory units except the first memory unit, which is not limited in this embodiment of the present application.
在上述图5所示的检测过程中,还可能会接收到对第一内存单元的数据访问请求(包括数据读请求、数据写请求),如下以数据访问请求来自处理器110为例,分别针对在内存检测过程中接收到的数据读请求、数据写请求的处理流程进行介绍:In the detection process shown in FIG. 5 above, a data access request (including a data read request and a data write request) to the first memory unit may also be received. Take the data access request from the processor 110 as an example as follows, respectively for The processing flow of the data read request and data write request received during the memory detection process is introduced:
1,在内存检测过程中接收到数据读请求;1. A data read request is received during the memory detection process;
该数据读请求可以是在图5所示的内存检测流程的任一时机接收到的,本申请实施例对此不做限定。无论在哪一时机接收到该数据读请求,基于第一数据是否存在数据错误存在如下几种响应方式:The data read request may be received at any timing of the memory detection process shown in FIG. 5 , which is not limited in this embodiment of the present application. No matter at what time the data read request is received, there are the following response methods based on whether there is a data error in the first data:
1)若在步骤502确定第一数据没有数据错误,则响应于该数据读请求,内存控制器1101将缓存1102的第一缓存行中存储的第一数据发送给处理器110。1) If it is determined in step 502 that the first data has no data error, then in response to the data read request, the memory controller 1101 sends the first data stored in the first cache line of the cache 1102 to the processor 110 .
值得注意的是,该数据读请求可以是在步骤502之前接收到的,也可以是在步骤502之后接收到的,如果是在步骤502之前接收到的,则可以等待步骤502执行完成,且确定第一数据没有数据错误之后再响应该数据读请求,以保证向处理器110返回正确的数据,提高数据可靠性。下文类似之处不再赘述。It is worth noting that the data read request may be received before step 502 or after step 502. If it is received before step 502, it may wait for step 502 to be executed, and determine The data read request is responded to after the first data has no data error, so as to ensure that correct data is returned to the processor 110 and improve data reliability. Similarities will not be repeated below.
2)若在步骤503确定第一数据存在可纠正错误,则响应于该数据读请求,在步骤504将纠错后的第二数据写入第一缓存行之后,内存控制器1101将第一缓存行中存储的第二数据发送给处理器110。2) If it is determined in step 503 that there is a correctable error in the first data, then in response to the data read request, after writing the error-corrected second data into the first cache line in step 504, the memory controller 1101 writes the first cache line The second data stored in the row is sent to the processor 110 .
3)若在步骤503确定第一数据存在不可纠正错误,则响应于该数据读请求,内存控制器1101发送错误响应。3) If it is determined in step 503 that there is an uncorrectable error in the first data, then in response to the data read request, the memory controller 1101 sends an error response.
2,在内存检测过程中接收到数据写请求;2. A data write request is received during the memory detection process;
类似的,该数据写请求可以是在图5所示的内存检测流程的任一时机接收到的,本申请实施例对此不做限定。无论在哪一时机接收到该数据写请求,基于第一内存单元是否存在硬失效位存在如下几种响应方式:Similarly, the data write request may be received at any timing of the memory detection process shown in FIG. 5 , which is not limited in this embodiment of the present application. Regardless of when the data write request is received, there are the following response methods based on whether the first memory unit has a hard fail bit:
1),若在步骤512确定第一内存单元没有硬失效位,则内存控制器1101将该数据写请求中携带的数据(如称为第五数据)写入第一内存单元。1) If it is determined in step 512 that the first memory unit does not have a hard fail bit, the memory controller 1101 writes the data carried in the data write request (for example, fifth data) into the first memory unit.
如果在步骤512之前接收到该数据写请求,则可以先将该数据写请求中携带的第五数据保存在缓存1102中,在一种可选的实施方式中,可以将第五数据存储至第一缓存行中, 这样在执行步骤513时,便可以直接将第五数据写入第一内存单元。在另一种可选的方式中,也可以将第五数据存储至其他缓存行中,如称为第二缓存行,则当步骤512确定第一内存单元没有硬失效位之后,再从第二缓存行获取该第五数据并写入第一内存单元,且不再执行步骤513。或者,如果该数据写请求是在步骤512之后接收到的,则可以直接将数据写请求中携带的第五数据写入第一内存单元,且不再执行步骤510b。If the data write request is received before step 512, the fifth data carried in the data write request may be stored in the cache 1102 first, and in an optional implementation manner, the fifth data may be stored in the first In this way, when step 513 is executed, the fifth data can be directly written into the first memory unit. In another optional manner, the fifth data may also be stored in other cache lines, such as the second cache line, and after step 512 determines that the first memory unit has no hard fail bit, the second The cache line acquires the fifth data and writes it into the first memory unit, and step 513 is not executed again. Alternatively, if the data write request is received after step 512, the fifth data carried in the data write request may be directly written into the first memory unit, and step 510b is not performed again.
2),若在步骤512确定第一内存单元存在硬失效位,则响应于该数据写请求,内存控制器1101将该数据写请求中的第五数据写入内存120中除第一内存单元之外的一个新的空闲的内存单元,该新的内存单元可以是通过内存检测确定不存在硬失效位的内存单元,也可以是第一内存单元对应的备用的内存单元,本申请实施例对此不做限定。2) If it is determined in step 512 that the first memory unit has a hard fail bit, then in response to the data write request, the memory controller 1101 writes the fifth data in the data write request into the memory 120 except the first memory unit A new free memory unit, the new memory unit may be a memory unit that does not have a hard fail bit as determined by memory detection, or it may be a spare memory unit corresponding to the first memory unit. No limit.
运行检测方式一,既可以检测出第一内存单元所存储的数据中的数据错误,在存在数据时可以及时纠错,又可以对第一内存单元的硬件进行检测,以及时发现硬失效位,降低系统宕机风险,提高系统运行稳定性和可靠性,另外还提供了一种在内存检测过程中,响应数据访问请求的方式,尽可能降低了数据访问请求的时延。The first operation detection method can not only detect the data error in the data stored in the first memory unit, and correct the error in time when there is data, but also detect the hardware of the first memory unit, and find the hard failure bit in time, Reduce the risk of system downtime, improve the stability and reliability of system operation, and also provide a way to respond to data access requests during the memory detection process, reducing the delay of data access requests as much as possible.
检测方式二:仅对第一内存单元进行数据检测。Detection method two: only perform data detection on the first memory unit.
对第一内存单元执行数据错误的检测流程可以参见上文图5中步骤501~步骤505的具体介绍,此处不再赘述。For the process of performing data error detection on the first memory unit, refer to the specific introduction of steps 501 to 505 in FIG. 5 above, which will not be repeated here.
类似的,在对第一内存单元所存储的数据执行数据错误检测的过程中,也可能会接收到对第一内存单元的数据访问请求,其中,对于在内存检测过程中接收到数据读请求的响应方式可以参见前述的检测方式一中对于数据读请求的响应方式的介绍,此处不再赘述。Similarly, in the process of performing data error detection on the data stored in the first memory unit, a data access request to the first memory unit may also be received, wherein, for the data read request received during the memory detection process For the response method, please refer to the introduction of the response method to the data read request in the first detection method, which will not be repeated here.
对于在内存检测过程中接收到数据写请求时,响应于该数据写请求,内存控制器1101可以直接将该数据写请求中携带的第五数据写入第一内存单元,且可以中止数据检测流程。When receiving a data write request during the memory detection process, in response to the data write request, the memory controller 1101 may directly write the fifth data carried in the data write request into the first memory unit, and may terminate the data detection process .
运行检测方式二,可以对第一内存单元所存储的数据进行纠错,并且,提供了在数据检测过程中对数据访问的请求的响应方式,可以尽可能降低在内存检测过程中数据访问请求的时延,并且可以提高数据可靠性。The second operation detection method can correct the data stored in the first memory unit, and provides a response method to the data access request during the data detection process, which can reduce the data access request during the memory detection process as much as possible. time delay and can improve data reliability.
检测方式三:仅对第一内存单元进行硬件检测。Detection method three: only perform hardware detection on the first memory unit.
对第一内存单元执行硬件检测的检测流程可以参见上文图5中步骤501、步骤506~步骤514的具体介绍,此处不再赘述。For the detection process of performing hardware detection on the first memory unit, refer to the specific introduction of step 501, step 506-step 514 in FIG. 5 above, and details will not be repeated here.
类似的,在对第一内存单元执行硬失效位检测的过程中,可能会接收到对第一内存单元的数据访问请求,其中,在内存检测过程中接收到数据写请求的响应方式可以参见前述的检测方式一中对于数据写请求的响应方式的介绍,此处不再赘述。Similarly, in the process of performing hard fail bit detection on the first memory unit, a data access request to the first memory unit may be received, wherein, the response method of receiving a data write request during the memory detection process can be referred to the aforementioned The introduction of the response method to the data write request in the detection method 1 will not be repeated here.
对于在内存检测过程中接收到数据读请求,响应于该数据读请求,内存控制器1101可以将第一缓存行中存储的第一数据发送给处理器110。For receiving a data read request during the memory detection process, in response to the data read request, the memory controller 1101 may send the first data stored in the first cache line to the processor 110 .
运行检测方式三,可以对第一内存单元进行硬件检测,以检测第一内存单元的硬失效位,并且,提供了在硬件检测过程中对数据访问的请求的响应方式,可以尽可能降低在内存检测过程中数据访问请求的时延,并且可以提高数据可靠性。The third operation detection method can perform hardware detection on the first memory unit to detect the hard failure bit of the first memory unit, and provides a response method to data access requests during the hardware detection process, which can reduce memory usage as much as possible. The delay of data access requests in the detection process can improve data reliability.
上述方式,当第一内存单元在一段历史时间内的访问频次小于或等于预设阈值时,便可以触发对第一内存单元进行内存检测确定第一内存单元的检测结果。可以在开机启动或运行过程中结合第一内存单元的访问频次决定是否触发内存检测,尽可能降低对第一内存单元的数据访问请求的时延,且内存检测不再受限于特定时机,提供了内存检测的灵活性,可以及时发现内存故障,减少系统宕机风险,提高系统运行稳定性和可靠性。In the above manner, when the access frequency of the first memory unit within a historical period is less than or equal to the preset threshold, memory detection of the first memory unit may be triggered to determine the detection result of the first memory unit. It is possible to determine whether to trigger memory detection based on the access frequency of the first memory unit during startup or operation, so as to reduce the delay of data access requests to the first memory unit as much as possible, and the memory detection is no longer limited to specific timing, providing It improves the flexibility of memory detection, can detect memory faults in time, reduces the risk of system downtime, and improves the stability and reliability of system operation.
基于与方法实施例相同的构思,本申请实施例还提供了一种内存检测装置,该内存检测装置用于执行上述图3或图5所示的方法实施例中内存控制器所执行的方法。如图6所示,该内存检测装置600包括确定模块601、检测模块602,可选的,还可以包括读取模块603、通信模块604,具体的,在内存检测装置600中,各模块直接通过通信通路建立连接。Based on the same idea as the method embodiment, the embodiment of the present application further provides a memory detection device, which is used to execute the method performed by the memory controller in the method embodiment shown in FIG. 3 or FIG. 5 . As shown in FIG. 6, the memory detection device 600 includes a determination module 601 and a detection module 602. Optionally, it may also include a reading module 603 and a communication module 604. Specifically, in the memory detection device 600, each module directly passes The communication path establishes the connection.
确定模块601,用于基于第一内存单元的历史访问记录确定所述第一内存单元在一段预设时长的历史时间段内的访问频次;所述第一内存单元为所述多个内存单元中的任意一个内存单元;所述历史访问记录用于记录针对所述第一内存单元的访问请求的信息;还用于判断所述访问频率是否满足预设条件;具体实现方式请参见图3中的步骤302的描述,此处不再赘述。A determination module 601, configured to determine the access frequency of the first memory unit within a historical time period of a preset duration based on the historical access records of the first memory unit; the first memory unit is one of the plurality of memory units Any one of the memory units; the historical access record is used to record the information of the access request for the first memory unit; it is also used to judge whether the access frequency satisfies the preset condition; for the specific implementation, please refer to the The description of step 302 will not be repeated here.
检测模块602,用于在该访问频率满足预设条件时,对所述第一内存单元执行内存检测确定第一内存单元的检测结果。具体实现方式请参见图3中的步骤303的描述,此处不再赘述。The detection module 602 is configured to perform memory detection on the first memory unit to determine a detection result of the first memory unit when the access frequency satisfies a preset condition. For a specific implementation manner, please refer to the description of step 303 in FIG. 3 , which will not be repeated here.
作为一种可能的实施方式,所述预设条件包括第一内存单元的访问频率不大于预设阈值。As a possible implementation manner, the preset condition includes that the access frequency of the first memory unit is not greater than a preset threshold.
作为一种可能的实施方式,读取模块603,用于读取第一内存单元中存储的第一数据,并将第一数据存储至第一存储器的第一存储单元中;具体实现方式请参见图5中的步骤501的描述,此处不再赘述。As a possible implementation, the reading module 603 is configured to read the first data stored in the first memory unit, and store the first data in the first storage unit of the first memory; for the specific implementation, please refer to The description of step 501 in FIG. 5 will not be repeated here.
检测模块602,用于校验第一数据是否存在数据错误;若第一数据不存在错误,则保持所述第一存储单元内存储的所述第一数据;或若所述第一数据存在可纠正错误CE,则所述检测模块还用于对所述第一数据进行纠错以得到第二数据,并将所述第二数据写入所述第一存储单元;或若所述第一数据存在不可纠正错误UCE,则通过所述通信模块发送用于指示所述UCE的错误信息。具体实现方式请参见图5中的步骤502~步骤505的描述,此处不再赘述。The detection module 602 is configured to check whether there is a data error in the first data; if there is no error in the first data, then keep the first data stored in the first storage unit; or if the first data exists, it can Correcting the error CE, the detection module is also used to correct the first data to obtain the second data, and write the second data into the first storage unit; or if the first data If there is an uncorrectable error UCE, send error information for indicating the UCE through the communication module. For a specific implementation manner, please refer to the description of steps 502 to 505 in FIG. 5 , which will not be repeated here.
作为一种可能的实施方式,检测模块602,还用于检测第一内存单元是否存在硬失效位,硬失效位为写入的二进制值与读取的二进制值不同的比特位;若存在,则通过所述通信模块发送故障信息,所述故障信息用于指示检测到的所述第一内存单元中的一个或多个硬失效位;或若不存在,且所述第一数据存在CE,则所述检测模块还用于将所述第一存储单元中存储的所述第二数据写入所述第一内存单元;或,若不存在,且所述第一数据不存在错误,则所述检测模块还用于将所述第一存储单元中存储的所述第一数据写入所述第一内存单元。具体实现方式请参见图5中的步骤506~步骤514的描述,此处不再赘述。As a possible implementation manner, the detection module 602 is also used to detect whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit whose binary value written is different from the binary value read; if it exists, then Sending fault information through the communication module, the fault information is used to indicate one or more detected hard failure bits in the first memory unit; or if not present, and the first data has a CE, then The detection module is further configured to write the second data stored in the first storage unit into the first memory unit; or, if it does not exist and there is no error in the first data, then the The detection module is further configured to write the first data stored in the first storage unit into the first memory unit. For the specific implementation manner, please refer to the description of steps 506 to 514 in FIG. 5 , which will not be repeated here.
作为一种可能的实施方式,通信模块604,用于接收用于请求读取所述第一内存单元所存储的数据的读请求;若所述第一数据存在CE,则所述确定模块601还用于响应于所述读请求从所述第一存储单元获取所述第二数据,并通过通信模块604发送所述第二数据;或若所述第一数据不存在错误,确定模块601还用于响应于读请求从第一存储单元获取第一数据,并通过通信模块604发送第一数据。As a possible implementation manner, the communication module 604 is configured to receive a read request for requesting to read the data stored in the first memory unit; if the first data has a CE, the determination module 601 also Obtaining the second data from the first storage unit in response to the read request, and sending the second data through the communication module 604; or if there is no error in the first data, the determining module 601 also uses The first data is acquired from the first storage unit in response to the read request, and the first data is sent through the communication module 604 .
作为一种可能的实施方式,通信模块604,用于接收写请求,该写请求用于请求将所述写请求中携带的第三数据写入所述第一内存单元;若第一内存单元不存在硬失效位,则确定模块601还用于将所述第三数据写入第一内存单元;或若第一内存单元存在硬失效位,则确定模块601还用于将所述第三数据写入第二内存单元,第二内存单元为内存所包括多 个内存单元中除第一内存单元之外的一个内存单元。As a possible implementation manner, the communication module 604 is configured to receive a write request, and the write request is used to request to write the third data carried in the write request into the first memory unit; if the first memory unit does not If there is a hard failure bit, the determination module 601 is also used to write the third data into the first memory unit; or if the first memory unit has a hard failure bit, the determination module 601 is also used to write the third data to The second memory unit is inserted into the second memory unit, and the second memory unit is a memory unit except the first memory unit among the plurality of memory units included in the memory.
基于上述内容和相同构思,本申请提供一种计算设备,图7为本申请实施例提供的一种计算设备700的示意图,如图7所示,所述计算设备700包括处理器701(处理器701中设置有内存控制器2)通信接口704、存储介质705和总线706。其中,处理器701、通信接口704和存储介质705通过总线706进行通信,也可以通过无线传输等其他手段实现通信。其中,内存介质702可以用于存储计算机执行指令,内存控制器2用于执行内存介质702存储的计算机执行指令。Based on the above content and the same idea, the present application provides a computing device. FIG. 7 is a schematic diagram of a computing device 700 provided in an embodiment of the present application. As shown in FIG. 7, the computing device 700 includes a processor 701 (processor 701 is provided with a memory controller 2) a communication interface 704, a storage medium 705, and a bus 706. Wherein, the processor 701 , the communication interface 704 and the storage medium 705 communicate through the bus 706 , or communicate through other means such as wireless transmission. Wherein, the memory medium 702 may be used to store computer-executable instructions, and the memory controller 2 is used to execute the computer-executable instructions stored in the memory medium 702 .
内存介质702存储计算机执行指令,且内存控制器2可以调用内存介质702中存储的计算机执行指令以执行以下操作:The memory medium 702 stores computer-executable instructions, and the memory controller 2 can call the computer-executable instructions stored in the memory medium 702 to perform the following operations:
基于所述第一内存单元的历史访问记录确定所述第一内存单元在一段预设时长的历史时间段内的访问频次;所述第一内存单元为所述多个内存单元中的任意一个内存单元;所述历史访问记录用于记录针对所述第一内存单元的访问请求的信息;Based on the historical access records of the first memory unit, determine the access frequency of the first memory unit within a historical period of a preset duration; the first memory unit is any one of the plurality of memory units unit; the historical access record is used to record the information of the access request for the first memory unit;
当所述访问频率满足预设条件时,对所述第一内存单元执行内存检测确定第一内存单元的检测结果。When the access frequency satisfies a preset condition, memory detection is performed on the first memory unit to determine a detection result of the first memory unit.
在本申请实施例中,处理器701集成内存控制器,具体地,处理器701可以是CPU,例如,X86架构的处理器或ARM架构的处理器。该处理器701还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件、片上系统(system on chip,SoC)、图形处理器(graphic processing unit,GPU)、人工智能(artificial intelligent,AI)芯片等。通用处理器可以是微处理器或者是任何常规的处理器等。In this embodiment of the present application, the processor 701 integrates a memory controller. Specifically, the processor 701 may be a CPU, for example, a processor of an X86 architecture or a processor of an ARM architecture. The processor 701 can also be other general-purpose processors, digital signal processors (digital signal processing, DSP), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components, system on chip (SoC), graphics processing unit (graphic processing unit, GPU), artificial intelligence (artificial intelligent, AI) chips, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.
作为一种可能的实施例,内存控制器也可以是处理器701片外的控制器,用于实现与上述内存控制器相同的功能。As a possible embodiment, the memory controller may also be a controller off-chip of the processor 701, configured to implement the same function as the above memory controller.
内存介质702可以包括只读存储器和随机存取存储器,并向处理器701提供指令和数据。内存介质702还可以包括非易失性随机存取存储器。内存介质702可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。可选地,内存介质702还可以是存储级内存SCM,SCM包括相变存储器PCM,磁性随机存储器MRAM、电阻型随机存储器RRAM,铁电式存储器FRAM,快速NAND或纳米随机存储器NRAM中至少一种。The memory medium 702 may include read-only memory and random-access memory, and provides instructions and data to the processor 701 . Memory medium 702 may also include non-volatile random access memory. Memory medium 702 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM). Optionally, the memory medium 702 can also be a storage class memory SCM, and the SCM includes at least one of phase change memory PCM, magnetic random access memory MRAM, resistive random access memory RRAM, ferroelectric memory FRAM, fast NAND or nano random access memory NRAM .
总线706除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线04。总线706可以是快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或扩展工业标准结构 (extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线706可以分为地址总线、数据总线、控制总线等。In addition to the data bus, the bus 706 may also include a power bus, a control bus, a status signal bus, and the like. However, for clarity of illustration, the various buses are labeled as bus 04 in the figure. The bus 706 can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc. The bus 706 can be divided into address bus, data bus, control bus and so on.
值得说明的是,图7所示的计算设备700中虽然以一个处理器701为例,但具体实施时,计算设备700中可以包括多个处理器,且每个处理器中所包括的处理器核的个数不做限定。It should be noted that although one processor 701 is taken as an example in the computing device 700 shown in FIG. The number of cores is not limited.
应理解,根据本申请实施例的计算设备700中内存控制器2可对应于本申请实施例中的内存检测装置600,并可以对应于执行本申请图3或图5所示实施例的方法中的相应主体,并且内存介质702中的各个模块的上述和其它操作和/或功能分别为了实现各个方法的相应流程,为了简洁,在此不再赘述。It should be understood that the memory controller 2 in the computing device 700 according to the embodiment of the present application may correspond to the memory detection device 600 in the embodiment of the present application, and may correspond to the implementation of the method in the embodiment shown in FIG. 3 or FIG. 5 of the present application. The corresponding subjects of the memory medium 702, and the above and other operations and/or functions of each module in the memory medium 702 are respectively for realizing the corresponding processes of each method, and for the sake of brevity, details are not repeated here.
作为一种可能的实施例,本申请还提供一种内存控制器,该内存控制器用于实现图3至图5所示方法的操作步骤,为了简洁,在此不再赘述。As a possible embodiment, the present application further provides a memory controller, which is used to implement the operation steps of the methods shown in FIG. 3 to FIG. 5 , and details are not described here for brevity.
基于上述内容和相同构思,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序或指令,当计算机程序或指令被计算机执行时,计算机执行上述方法实施例中内存控制器执行的方法。Based on the above content and the same idea, the present application provides a computer-readable storage medium, in which computer programs or instructions are stored. When the computer programs or instructions are executed by the computer, the computer executes the memory control method described in the above-mentioned method embodiments. The method executed by the device.
基于上述内容和相同构思,本申请提供一种计算机程序产品,该计算机程序产品包括计算机程序或指令,当该计算机程序或指令被计算设备执行时,实现上述方法实施例中内存控制器执行的方法。Based on the above content and the same idea, the present application provides a computer program product, the computer program product includes a computer program or instruction, when the computer program or instruction is executed by a computing device, the method performed by the memory controller in the above method embodiment is implemented .
基于上述内容和相同构思,本申请提供一种芯片,包括至少一个处理器和接口;所述接口,用于为所述至少一个处理器提供程序指令或者数据;所述至少一个处理器用于执行所述程序行指令,以实现上述方法实施例中内存控制器执行的方法。Based on the above content and the same idea, the present application provides a chip, including at least one processor and an interface; the interface is used to provide program instructions or data for the at least one processor; the at least one processor is used to execute the The above program line instructions are used to implement the method executed by the memory controller in the above method embodiment.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.
本申请实施例中所描述的各种说明性的逻辑单元和电路可以通过通用处理器,数字信号处理器,专用集成电路(ASIC),现场可编程门阵列(FPGA)或其它可编程逻辑装置,离散门或晶体管逻辑,离散硬件部件,或上述任何组合的设计来实现或操作所描述的功能。通用处理器可以为微处理器,可选地,该通用处理器也可以为任何传统的处理器、控制器、微控制器或状态机。处理器也可以通过计算装置的组合来实现,例如数字信号处理器和微处理器,多个微处理器,一个或多个微处理器联合一个数字信号处理器核,或任何其它类似的配置来实现。The various illustrative logic units and circuits described in the embodiments of the present application can be implemented by a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, Discrete gate or transistor logic, discrete hardware components, or any combination of the above designed to implement or operate the described functions. The general-purpose processor may be a microprocessor, and optionally, the general-purpose processor may also be any conventional processor, controller, microcontroller or state machine. A processor may also be implemented by a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration to accomplish.
本申请实施例中所描述的方法或算法的步骤可以直接嵌入硬件、处理器执行的软件单元、或者这两者的结合。软件单元可以存储于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动磁盘、CD-ROM或本领域中其它任意形式的存储媒介中。示例性地,存储媒介可以与处理器连接,以使得处理器可以从存储媒介中读取信息,并可以向存储媒介存写信息。可选地,存储媒介还可以集成到处理器中。处理器和存储媒介可以设置于ASIC中。The steps of the method or algorithm described in the embodiments of the present application may be directly embedded in hardware, a software unit executed by a processor, or a combination of both. The software unit may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other storage medium in the art. Exemplarily, the storage medium can be connected to the processor, so that the processor can read information from the storage medium, and can write information to the storage medium. Optionally, the storage medium can also be integrated into the processor. The processor and storage medium can be provided in an ASIC.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。Although the application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely illustrative of the application as defined by the appended claims and are deemed to cover any and all modifications, variations, combinations or equivalents within the scope of this application. Obviously, those skilled in the art can make various changes and modifications to the application without departing from the scope of the application. In this way, if these modifications and variations of the application fall within the scope of the claims of the application and their equivalent technologies, the application also intends to include these modifications and variations.

Claims (19)

  1. 一种内存检测方法,其特征在于,应用于计算设备,所述计算设备包括内存控制器和内存,所述内存包括多个内存单元,所述方法包括:A memory detection method, characterized in that it is applied to a computing device, the computing device includes a memory controller and a memory, the memory includes a plurality of memory units, and the method includes:
    所述内存控制器基于所述第一内存单元的历史访问记录确定所述第一内存单元在一段预设时长的历史时间段内的访问频次;所述第一内存单元为所述多个内存单元中的任意一个内存单元;所述历史访问记录用于记录针对所述第一内存单元的访问请求的信息;The memory controller determines, based on the historical access records of the first memory unit, the access frequency of the first memory unit within a historical time period of a preset duration; the first memory unit is the plurality of memory units Any one of the memory units; the historical access record is used to record the information of the access request for the first memory unit;
    当所述访问频率满足预设条件时,所述内存控制器对所述第一内存单元执行内存检测确定第一内存单元的检测结果。When the access frequency satisfies a preset condition, the memory controller performs memory detection on the first memory unit to determine a detection result of the first memory unit.
  2. 如权利要求1所述的方法,其特征在于,所述预设条件,包括所述第一内存单元的访问频次不大于预设阈值。The method according to claim 1, wherein the preset condition includes that the access frequency of the first memory unit is not greater than a preset threshold.
  3. 如权利要求1或2所述的方法,其特征在于,对所述第一内存单元执行内存检测确定第一内存单元的检测结果,包括:The method according to claim 1 or 2, wherein performing memory detection on the first memory unit to determine the detection result of the first memory unit comprises:
    所述内存控制器读取第一内存单元中存储的第一数据,将所述第一数据存储至所述内存控制器的第一存储器的第一存储单元中;The memory controller reads the first data stored in the first memory unit, and stores the first data in the first storage unit of the first memory of the memory controller;
    所述内存控制器校验所述第一数据是否存在数据错误;The memory controller checks whether there is a data error in the first data;
    若所述第一数据不存在错误,则保持所述第一存储单元内存储的所述第一数据;或If there is no error in the first data, maintaining the first data stored in the first storage unit; or
    若所述第一数据存在可纠正错误CE,则对所述第一数据进行纠错以得到第二数据,并将所述第二数据写入所述第一存储单元;或If there is a correctable error CE in the first data, performing error correction on the first data to obtain second data, and writing the second data into the first storage unit; or
    若所述第一数据存在不可纠正错误UCE,则发送用于指示所述UCE的错误信息。If there is an uncorrectable error UCE in the first data, sending error information for indicating the UCE.
  4. 如权利要求3所述的方法,其特征在于,所述内存控制器将所述第一数据存储至所述第一存储单元之后,还包括:The method according to claim 3, wherein after the memory controller stores the first data in the first storage unit, further comprising:
    所述内存控制器检测所述第一内存单元是否存在硬失效位,所述硬失效位为写入的二进制值与读取的二进制值不同的比特位;The memory controller detects whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit with a binary value written different from a read binary value;
    若存在,则发送故障信息,所述故障信息用于指示检测到的所述第一内存单元中的一个或多个硬失效位;或if present, sending fault information, the fault information being used to indicate one or more hard fail bits detected in the first memory unit; or
    若不存在,且所述第一数据存在CE,则将所述第一存储单元中存储的所述第二数据写入所述第一内存单元;或,若不存在,且所述第一数据不存在错误,则将所述第一存储单元中存储的所述第一数据写入所述第一内存单元。If it does not exist, and the first data has a CE, write the second data stored in the first storage unit into the first memory unit; or, if it does not exist, and the first data If there is no error, write the first data stored in the first storage unit into the first memory unit.
  5. 如权利要求3或4所述的方法,其特征在于,所述内存控制器将所述第一数据存储至所述第一存储单元之后,还包括:The method according to claim 3 or 4, wherein after the memory controller stores the first data in the first storage unit, further comprising:
    接收读请求,所述读请求用于请求读取所述第一内存单元所存储的数据;receiving a read request, where the read request is used to request to read data stored in the first memory unit;
    若所述第一数据存在CE,则响应于所述读请求从所述第一存储单元获取所述第二数据,并返回所述第二数据;或If there is a CE for the first data, obtaining the second data from the first storage unit in response to the read request, and returning the second data; or
    若所述第一数据不存在错误,响应于所述读请求从所述第一存储单元获取所述第一数据,并返回所述第一数据。If there is no error in the first data, acquiring the first data from the first storage unit in response to the read request, and returning the first data.
  6. 如权利要求3或4所述的方法,其特征在于,所述内存控制器将所述第一数据存储至所述第一存储单元之后,还包括:The method according to claim 3 or 4, wherein after the memory controller stores the first data in the first storage unit, further comprising:
    侦测到所述第一内存单元的写请求,所述写请求用于请求将第三数据写入所述第一内存单元;Detecting a write request of the first memory unit, where the write request is used to request writing third data into the first memory unit;
    若所述第一内存单元不存在硬失效位,则将所述第三数据写入所述第一内存单元;或If there is no hard fail bit in the first memory unit, writing the third data into the first memory unit; or
    若所述第一内存单元存在硬失效位,则将所述第三数据写入第二内存单元,所述第二内存单元为所述内存所包括多个内存单元中除所述第一内存单元之外的一个内存单元。If there is a hard failure bit in the first memory unit, then write the third data into the second memory unit, and the second memory unit is a plurality of memory units included in the memory except the first memory unit A memory unit other than
  7. 如权利要求1或2所述的方法,其特征在于,对所述第一内存单元执行内存检测确定第一内存单元的检测结果,包括:The method according to claim 1 or 2, wherein performing memory detection on the first memory unit to determine the detection result of the first memory unit comprises:
    所述内存控制器读取所述第一内存单元中存储的第一数据,并将所述第一数据存储至所述内存控制器的第一存储器的第一存储单元中;The memory controller reads first data stored in the first memory unit, and stores the first data in a first storage unit of a first memory of the memory controller;
    所述内存控制器检测所述第一内存单元是否存在硬失效位,所述硬失效位为写入的二进制值与读取的二进制值不同的比特位;The memory controller detects whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit with a binary value written different from a read binary value;
    若存在,则发送故障信息,所述故障信息用于指示检测到的所述第一内存单元中的一个或多个硬失效位;或若不存在,则将所述第一存储单元中存储的所述第一数据写回至所述第一内存单元。If it exists, send fault information, and the fault information is used to indicate one or more hard failure bits detected in the first memory unit; or if it does not exist, the stored in the first memory unit The first data is written back to the first memory unit.
  8. 如权利要求4或7所述的方法,其特征在于,所述内存控制器检测所述第一内存单元是否存在硬失效位,包括:The method according to claim 4 or 7, wherein the memory controller detects whether a hard fail bit exists in the first memory unit, comprising:
    所述内存控制器将第一检测数据写入所述第一内存单元,并读回所述第一内存单元内存储的第四数据;The memory controller writes the first detection data into the first memory unit, and reads back the fourth data stored in the first memory unit;
    将所述第一检测数据和所述第四数据中相同位置的比特位上的比特值进行比对,以确定所述第一内存单元中是否存在硬失效位;Comparing the bit values of the bits at the same position in the first detection data and the fourth data to determine whether there is a hard fail bit in the first memory unit;
    所述内存控制器将第二检测数据写入所述第一内存单元,并读回所述第一内存单元存储的第五数据;将所述第二检测数据和所述第五数据中相同位置的比特位上的比特值进行比对,以确定所述第一内存单元中是否存在硬失效位;The memory controller writes the second detection data into the first memory unit, and reads back the fifth data stored in the first memory unit; writes the second detection data and the fifth data in the same position Compare the bit values on the bits of the first memory unit to determine whether there is a hard fail bit in the first memory unit;
    其中,所述第二检测数据与所述第一检测数据不同。Wherein, the second detection data is different from the first detection data.
  9. 一种内存检测装置,其特征在于,所述内存检测装置包括:A kind of memory detection device, it is characterized in that, described memory detection device comprises:
    所述确定模块,用于基于第一内存单元的历史访问记录确定所述第一内存单元在一段预设时长的历史时间段内的访问频次;所述第一内存单元为所述多个内存单元中的任意一个内存单元;所述历史访问记录用于记录针对所述第一内存单元的访问请求的信息;还用于判断所述访问频次是否满足预设条件;The determination module is configured to determine the access frequency of the first memory unit within a historical time period of a preset duration based on the historical access records of the first memory unit; the first memory unit is the plurality of memory units Any one of the memory units; the historical access record is used to record the information of the access request for the first memory unit; it is also used to judge whether the access frequency meets a preset condition;
    当所述访问频次满足预设条件时,所述检测模块,用于对所述第一内存单元执行内存检测确定第一内存单元的检测结果。When the access frequency satisfies a preset condition, the detection module is configured to perform a memory detection on the first memory unit to determine a detection result of the first memory unit.
  10. 如权利要求9所述的装置,其特征在于,所述预设条件包括所述第一内存单元的访问频次不大于预设阈值。The device according to claim 9, wherein the preset condition includes that the access frequency of the first memory unit is not greater than a preset threshold.
  11. 如权利要求9或10所述的装置,其特征在于,所述装置还包括读取模块、通信模块:The device according to claim 9 or 10, wherein the device also includes a reading module and a communication module:
    所述读取模块,用于读取第一内存单元中存储的第一数据,并将所述第一数据存储至第一存储器的第一存储单元中;The reading module is used to read the first data stored in the first memory unit, and store the first data in the first storage unit of the first memory;
    所述检测模块,用于校验所述第一数据是否存在数据错误;The detection module is used to check whether there is a data error in the first data;
    若所述第一数据不存在错误,则保持所述第一存储单元内存储的所述第一数据;或If there is no error in the first data, maintaining the first data stored in the first storage unit; or
    若所述第一数据存在可纠正错误CE,则所述检测模块还用于对所述第一数据进行纠错以得到第二数据,并将所述第二数据写入所述第一存储单元;或If the first data has a correctable error CE, the detection module is further configured to correct the first data to obtain second data, and write the second data into the first storage unit ;or
    若所述第一数据存在不可纠正错误UCE,则通过所述通信模块发送用于指示所述UCE 的错误信息。If there is an uncorrectable error UCE in the first data, sending error information for indicating the UCE through the communication module.
  12. 如权利要求11所述的装置,其特征在于,所述装置还包括检测模块、通信模块:The device according to claim 11, further comprising a detection module and a communication module:
    所述检测模块,用于检测所述第一内存单元是否存在硬失效位,所述硬失效位为写入的二进制值与读取的二进制值不同的比特位;The detection module is used to detect whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit different from the binary value written and the binary value read;
    若存在,则通过所述通信模块发送故障信息,所述故障信息用于指示检测到的所述第一内存单元中的一个或多个硬失效位;或If there is, sending fault information through the communication module, the fault information is used to indicate one or more detected hard failure bits in the first memory unit; or
    若不存在,且所述第一数据存在CE,则所述检测模块还用于将所述第一存储单元中存储的所述第二数据写入所述第一内存单元;或,若不存在,且所述第一数据不存在错误,则所述检测模块还用于将所述第一存储单元中存储的所述第一数据写入所述第一内存单元。If it does not exist, and the first data has a CE, the detection module is further configured to write the second data stored in the first storage unit into the first memory unit; or, if it does not exist , and there is no error in the first data, the detection module is further configured to write the first data stored in the first storage unit into the first memory unit.
  13. 如权利要求11或12所述的装置,其特征在于,所述装置还包括通信模块:The device according to claim 11 or 12, wherein the device further comprises a communication module:
    所述通信模块,用于接收读请求,所述读请求用于请求读取所述第一内存单元所存储的数据;The communication module is configured to receive a read request, and the read request is used to request to read data stored in the first memory unit;
    若所述第一数据存在CE,则所述确定模块还用于响应于所述读请求从所述第一存储单元获取所述第二数据,并通过所述通信模块发送所述第二数据;或If the first data has a CE, the determining module is further configured to obtain the second data from the first storage unit in response to the read request, and send the second data through the communication module; or
    若所述第一数据不存在错误,所述确定模块还用于响应于所述读请求从所述第一存储单元获取所述第一数据,并通过所述通信模块发送所述第一数据。If there is no error in the first data, the determination module is further configured to obtain the first data from the first storage unit in response to the read request, and send the first data through the communication module.
  14. 如权利要求11或12所述的装置,其特征在于,所述装置还包括通信模块:The device according to claim 11 or 12, wherein the device further comprises a communication module:
    所述通信模块,用于接收写请求,所述写请求用于请求将所述写请求中携带的第三数据写入所述第一内存单元;The communication module is configured to receive a write request, and the write request is used to request to write the third data carried in the write request into the first memory unit;
    若所述第一内存单元不存在硬失效位,则所述确定模块还用于将所述第三数据写入所述第一内存单元;或If the first memory unit does not have a hard fail bit, the determining module is further configured to write the third data into the first memory unit; or
    若所述第一内存单元存在硬失效位,则所述确定模块还用于将所述第三数据写入第二内存单元,所述第二内存单元为所述内存所包括多个内存单元中除所述第一内存单元之外的一个内存单元。If there is a hard failure bit in the first memory unit, the determination module is also used to write the third data into a second memory unit, and the second memory unit is one of the plurality of memory units included in the memory A memory cell other than the first memory cell.
  15. 如权利要求9或10所述的装置,其特征在于,所述装置还包括读取模块、通信模块:The device according to claim 9 or 10, wherein the device also includes a reading module and a communication module:
    所述读取模块,用于读取所述第一内存单元中存储的第一数据,并将所述第一数据存储至第一存储器的第一存储单元中;The reading module is configured to read the first data stored in the first memory unit, and store the first data in the first storage unit of the first memory;
    所述检测模块,用于检测所述第一内存单元是否存在硬失效位,所述硬失效位为写入的二进制值与读取的二进制值不同的比特位;The detection module is used to detect whether there is a hard fail bit in the first memory unit, and the hard fail bit is a bit different from the binary value written and the binary value read;
    若存在,则所述检测模块还用于通过所述通信模块发送故障信息,所述故障信息用于指示检测到的所述第一内存单元中的一个或多个硬失效位;或若不存在,则所述检测模块还用于将所述第一存储单元中存储的所述第一数据写回至所述第一内存单元。If it exists, the detection module is further configured to send fault information through the communication module, and the fault information is used to indicate one or more detected hard failure bits in the first memory unit; or if it does not exist , the detection module is further configured to write back the first data stored in the first storage unit to the first memory unit.
  16. 如权利要求12或15所述的装置,其特征在于,所述检测模块具体用于:将第一检测数据写入所述第一内存单元,并通过所述读取模块读回所述第一内存单元内存储的第四数据;将所述第一检测数据和所述第四数据中相同位置的比特位上的比特值进行比对,以确定所述第一内存单元中是否存在硬失效位;The device according to claim 12 or 15, wherein the detection module is specifically configured to: write the first detection data into the first memory unit, and read back the first memory unit through the reading module. The fourth data stored in the memory unit; comparing the first detection data with the bit value at the same position in the fourth data to determine whether there is a hard failure bit in the first memory unit ;
    将第二检测数据写入所述第一内存单元,并通过所述读取模块读回所述第一内存单元存储的第五数据;将所述第二检测数据和所述第五数据中相同位置的比特位上的比特值进 行比对,以确定所述第一内存单元中是否存在硬失效位;Writing the second detection data into the first memory unit, and reading back the fifth data stored in the first memory unit through the read module; writing the second detection data to be the same as the fifth data Compare the bit values on the bits of the position to determine whether there is a hard fail bit in the first memory unit;
    其中,所述第二检测数据与所述第一检测数据不同。Wherein, the second detection data is different from the first detection data.
  17. 一种内存检测装置,其特征在于,所述计算装置包括处理器和存储器;A memory detection device, characterized in that the computing device includes a processor and a memory;
    所述存储器,用于存储计算机程序指令;The memory is used to store computer program instructions;
    所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求1至8中任一项所述的方法。The processor executes and invokes computer program instructions in the memory to perform the method according to any one of claims 1 to 8.
  18. 一种计算机可读存储介质,其特征在于,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令用于执行如权利要求1至8中任意一项所述的方法。A computer-readable storage medium, characterized by comprising a computer-readable storage medium storing program codes, the program codes including instructions for executing the method according to any one of claims 1-8.
  19. 一种计算设备,其特征在于,所述计算设备包括处理器、内存控制器;A computing device, characterized in that the computing device includes a processor and a memory controller;
    所述处理器,用于向所述内存控制器发送第一内存单元的数据访问请求;The processor is configured to send a data access request of the first memory unit to the memory controller;
    所述内存控制器,用于执行如权利要求1~8任一所述的方法。The memory controller is configured to execute the method according to any one of claims 1-8.
PCT/CN2022/101243 2021-09-30 2022-06-24 Memory detection method and apparatus WO2023050927A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111162543.XA CN115904828A (en) 2021-09-30 2021-09-30 Memory detection method and device
CN202111162543.X 2021-09-30

Publications (1)

Publication Number Publication Date
WO2023050927A1 true WO2023050927A1 (en) 2023-04-06

Family

ID=85733886

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101243 WO2023050927A1 (en) 2021-09-30 2022-06-24 Memory detection method and apparatus

Country Status (2)

Country Link
CN (1) CN115904828A (en)
WO (1) WO2023050927A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6934903B1 (en) * 2001-12-17 2005-08-23 Advanced Micro Devices, Inc. Using microcode to correct ECC errors in a processor
US7334159B1 (en) * 2003-09-29 2008-02-19 Rockwell Automation Technologies, Inc. Self-testing RAM system and method
CN102222025A (en) * 2011-06-17 2011-10-19 华为数字技术有限公司 Method and device for eliminating memory failure
CN110888821A (en) * 2019-09-30 2020-03-17 华为技术有限公司 Memory management method and device
CN112667422A (en) * 2019-10-16 2021-04-16 华为技术有限公司 Memory fault processing method and device, computing equipment and storage medium
WO2021185279A1 (en) * 2020-03-20 2021-09-23 华为技术有限公司 Memory failure processing method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6934903B1 (en) * 2001-12-17 2005-08-23 Advanced Micro Devices, Inc. Using microcode to correct ECC errors in a processor
US7334159B1 (en) * 2003-09-29 2008-02-19 Rockwell Automation Technologies, Inc. Self-testing RAM system and method
CN102222025A (en) * 2011-06-17 2011-10-19 华为数字技术有限公司 Method and device for eliminating memory failure
CN110888821A (en) * 2019-09-30 2020-03-17 华为技术有限公司 Memory management method and device
CN112667422A (en) * 2019-10-16 2021-04-16 华为技术有限公司 Memory fault processing method and device, computing equipment and storage medium
WO2021185279A1 (en) * 2020-03-20 2021-09-23 华为技术有限公司 Memory failure processing method and related device

Also Published As

Publication number Publication date
CN115904828A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
US10824499B2 (en) Memory system architectures using a separate system control path or channel for processing error information
US10002043B2 (en) Memory devices and modules
KR102378466B1 (en) Memory devices and modules
US8806285B2 (en) Dynamically allocatable memory error mitigation
CN105373443B (en) Data system with memory system architecture and data reading method
US9785570B2 (en) Memory devices and modules
US20160004587A1 (en) Method, apparatus and system for handling data error events with a memory controller
US20210286667A1 (en) Cloud scale server reliability management
US20230053582A1 (en) System and method for error reporting and handling
US20220107752A1 (en) Data access method and apparatus
US20190140660A1 (en) Die-wise residual bit error rate (rber) estimation for memories
US10521113B2 (en) Memory system architecture
US20210173632A1 (en) Technologies for providing remote out-of-band firmware updates
WO2023050927A1 (en) Memory detection method and apparatus
US20230325276A1 (en) Error correction method and apparatus
WO2023124333A1 (en) Firmware refreshing method and apparatus, wireless module and storage medium
US20120166686A1 (en) Method, apparatus and system for aggregating interrupts of a data transfer
CN111625199B (en) Method, device, computer equipment and storage medium for improving reliability of solid state disk data path
WO2023056687A1 (en) Solid state disk and data manipulation method and apparatus therefor, and electronic device
US20190042364A1 (en) Technologies for maintaining data integrity during data transmissions
US20210311833A1 (en) Targeted repair of hardware components in a computing device
WO2021195979A1 (en) Data storage method and related device
WO2021249046A1 (en) Data access method, controller, memory, and storage medium
US10255986B2 (en) Assessing in-field reliability of computer memories
CN116483600A (en) Memory fault processing method and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874324

Country of ref document: EP

Kind code of ref document: A1