WO2017215377A1 - Method and device for processing hard memory error - Google Patents

Method and device for processing hard memory error Download PDF

Info

Publication number
WO2017215377A1
WO2017215377A1 PCT/CN2017/083815 CN2017083815W WO2017215377A1 WO 2017215377 A1 WO2017215377 A1 WO 2017215377A1 CN 2017083815 W CN2017083815 W CN 2017083815W WO 2017215377 A1 WO2017215377 A1 WO 2017215377A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
memory
error
information
access instruction
Prior art date
Application number
PCT/CN2017/083815
Other languages
French (fr)
Chinese (zh)
Inventor
张晔
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017215377A1 publication Critical patent/WO2017215377A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check

Definitions

  • the present application relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for processing hard memory errors.
  • a system's memory entity is generally composed of several memory particles. There are several memory cells in the particle, and each cell stores one bit (bit) of data. When there is an error in the memory, there may be 1 bit or multiple bits. Memory errors are generally classified into soft errors and hard errors according to the cause. Soft errors occur randomly. For example, factors such as sudden occurrence of electronic interference near the memory may cause memory soft errors.
  • ECC Error Checking and Correcting
  • the hard error is caused by hardware damage or defects, so the data is always incorrect, and such errors cannot be corrected.
  • Memory usually supports ECC checksum error correction function, which can automatically correct error in soft error of memory. It can be found but can not be corrected for hard errors.
  • core carrier-class routers/switches It always carries a large amount of user services. If there is a memory failure in the system (here, mainly a memory hard error), there is generally no other way, and the memory must be replaced again. However, this will interrupt the business for a period of time, and the consequences will be more serious.
  • the embodiment of the invention provides a method and a device for processing a memory hard error, so as to achieve Correct the hard error of the memory in case of interrupting the service.
  • a method for processing a memory hard error including: determining that a hard error of the first address of the memory occurs; correcting the memory information in the first address, and correcting the error
  • the latter memory information is stored in the non-faulty second address in the memory; a hardware breakpoint is inserted at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, And jumping from an access instruction to the first address to an access instruction to the second address.
  • determining that the first address of the memory has a hard error fault comprises: receiving the error detection correction ECC interrupt signal reported by the memory; searching for the corresponding one according to the ECC interrupt signal in the ECC error capture address register First address.
  • the method further includes: performing memory information of the first address for a predetermined number of times. And reading and writing a test to determine whether the first address is faulty; and when the test result indicates that the first address is faulty, determining that the first address has a hard error fault.
  • jumping from the access instruction to the first address to the access instruction to the second address comprises: receiving a break triggered by the access operation of the hardware breakpoint for the first address Pointing up abnormal information; performing at least one of: the breakpoint abnormality information characterizing the memory information of the first address is instruction information, jumping from an access instruction to the instruction information for reading the first address And the access instruction to the second address; when the breakpoint abnormality information indicates that the memory information of the first address is data information, jumping from an access instruction for reading data information of the first address And an access instruction to the second address; the breakpoint abnormality information represents that the memory information of the first address is data information, and jumps from an access instruction to write data information of the first address to a pair The access instruction of the second address.
  • jumping from the access instruction to the first address to the access instruction to the second address comprises: receiving an access instruction to the first address; calculating the first address to Deviation of the second address; correcting the deviation by the first address to obtain the second address, and jumping to an access instruction of the second address.
  • the method further comprises: jumping to a next instruction of an access instruction to the first address.
  • a memory hard error processing apparatus including: a determining module configured to determine that a first address of a memory has a hard error fault; and an error correction module configured to be the first address
  • the memory information is error-corrected, and the error-corrected memory information is stored in the memory-free second address;
  • the jump module is configured to insert a hardware breakpoint at the first address, wherein The hardware breakpoint is for monitoring whether the first address is accessed and jumping from an access instruction to the first address to an access instruction to the second address.
  • the determining module includes: a first receiving unit configured to receive the error detection correcting ECC interrupt signal reported by the memory; and a searching unit configured to be in the ECC error trapping address register according to the ECC interrupt signal Finding the corresponding first address.
  • the determining module further includes: a testing unit, configured to: after the searching unit searches for the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal, The memory information of the first address is subjected to a predetermined number of read and write tests to determine whether the first address is faulty; and the determining unit is configured to determine that the first address has a hard error when the test result indicates that the first address is faulty .
  • the jump module further includes: a second receiving unit, configured to receive breakpoint abnormality information triggered by the access operation of the hardware breakpoint for the first address; first jump unit, setting In order to represent the memory information of the first address when the breakpoint abnormality information is instruction information, jump from an access instruction to the instruction information that reads the first address to an access instruction to the second address.
  • a second jump unit configured to: when the memory information indicating the first address is data information, the at least one of: performing operation on reading data information of the first address The access instruction jumps to an access instruction to the second address; jumps from an access instruction to the data information that writes the first address to an access instruction to the second address.
  • a storage medium is also provided.
  • the storage medium is arranged to store program code for performing the following steps:
  • the embodiment of the present invention first determines that a hard error fault occurs in the first address of the memory, and then performs error correction on the memory information in the first address, and stores the error-corrected memory information in the memory. a second address of the fault, and finally a hardware breakpoint is inserted at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed and from an access instruction to the first address Jump to the access instruction to the second address.
  • the memory information of the first address with a hard error fault in the memory is transferred to the second address of the memory, and the access instruction for accessing the original first address is jumped to the access instruction of the second address, so that the access can be accessed.
  • FIG. 1 is a flow chart of a method for processing a memory hard error according to an embodiment of the present invention
  • FIG. 2 is a block diagram showing the structure of a memory hard error processing apparatus according to an embodiment of the present invention
  • FIG. 3 is a structural block diagram 1 of a memory hard error processing apparatus according to an embodiment of the present invention.
  • FIG. 4 is a structural block diagram 2 of a memory hard error processing apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a process of generating a device according to an embodiment of the present invention.
  • FIG. 6 is a flow chart of creating a special hardware data breakpoint exception handling function Vector1 in FIG. 5;
  • FIG. 7 is a flow chart of creating a special hardware instruction breakpoint exception handling function Vector2 in FIG. 5;
  • FIG. 8 is a flow chart of the original hardware breakpoint exception handling function Vector of the modified system of FIG. 5;
  • FIG. 9 is a flowchart of creating a memory ECC interrupt processing function vector_ecc in FIG. 5;
  • FIG. 10 is a flow chart of an ECC check error occurring in accordance with an embodiment of the present invention.
  • FIG. 11 is a process flow diagram of a hardware breakpoint exception occurring in accordance with an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for processing a memory hard error according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps:
  • Step S102 determining that a hard error occurs in the first address of the memory
  • Step S104 performing error correction on the memory information in the first address, and storing the error-corrected memory information in the second address in the memory without failure;
  • Step S106 inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and jumps from the access instruction to the first address to the access instruction to the second address.
  • the instruction of the second address so that the memory information in the first address where the hard error occurs can be accessed without interrupting the service and replacing the memory, avoiding the uncorrectable memory storage unit, and avoiding the program error or the system crash Such serious consequences have improved the stability of the system.
  • the execution body of the above steps may be a processor, a CPU, or Save the management unit, etc., but is not limited to this.
  • determining that the first address of the memory has a hard error fault includes:
  • the storage function of the first address may be detected, and after step S12, the method may further include:
  • test result indicates that the first address is faulty
  • jumping from an access instruction to the first address to an access instruction to the second address comprises:
  • the breakpoint abnormality information represents that the memory information of the first address is instruction information, and jumps from an access instruction to the instruction information that reads the first address to an access instruction to the second address.
  • the breakpoint abnormality information characterizing the memory information of the first address is data information, jumping from an access instruction for reading data information of the first address to an access instruction to the second address;
  • the breakpoint exception information characterizes the memory information of the first address as data information, jumping from an access instruction to the data information writing the first address to an access instruction to the second address.
  • jumping from an access instruction to the first address to an access instruction to the second address comprises:
  • the jump to the next instruction of the instruction that generates the breakpoint exception ie, the operation instruction of the first address access
  • the jump operation can be implemented in software by different functions, and can be constructed in the form of assembly language or binary code, and in which language (eg, C language, java, etc.) is used to construct the jump.
  • the transfer function is not limited in this embodiment:
  • A_ok (specified fault-free special memory address) is used to save the error-corrected data D_ok.
  • A1_code1 Vector1 entry address
  • A1_code2 corrected data read/write instruction address
  • A1_stack Vetor1 stack frame address
  • the function content is: 1. Save the value of each register of the breakpoint exception to A1_stack. 2. Analyze the instruction code C1_old that triggers the hardware breakpoint, calculate the specified memory address A_ok that retains the correct data, and the memory address A_error deviation of the ECC error.
  • A2_code Vector2 entry address
  • A2_stack Vetor2 stack frame address
  • the sink code C2_new is such an instruction: jump from the address of the sink code C2_new (the length of the sink code represented by the data D_ok at the A_ok+A_ok address) to the next sink code address of the trigger data breakpoint command C_old (C_old+C_old length). 2 from the address A2_stack restores the register values of the breakpoint exception field. 3. Jump to A_ok.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present application can be embodied in the form of a software product stored in a storage medium (such as a ROM/RAM, a magnetic disk, an optical disk), and includes a plurality of instructions for making a terminal.
  • the device (which may be a cell phone, computer, server, or network device, etc.) performs the methods of various embodiments of the present invention.
  • a memory hard error processing device is also provided, which is used to implement the foregoing embodiments and implementation manners, and has not been described again.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the devices described in the following embodiments may be implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 2 is a structural block diagram of a memory hard error processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes:
  • the determining module 20 is configured to determine that the first address of the memory has a hard error fault
  • the error correction module 22 is configured to perform error correction on the memory information in the first address, and store the error-corrected memory information in the second address in the memory without failure;
  • the jump module 24 is configured to insert a hardware breakpoint at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and jump from the access instruction to the first address to the second address Access to the instruction.
  • FIG. 3 is a block diagram showing the structure of a memory hard error processing apparatus according to an embodiment of the present invention.
  • the determining module 20 includes:
  • the first receiving unit 30 is configured to receive an error detection correction ECC interrupt signal reported by the memory
  • the searching unit 32 is configured to look up the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal.
  • the determining module 20 further includes:
  • the test unit 34 is configured to: after the search unit searches for the corresponding first address in the ECC error trap address register according to the ECC interrupt signal, perform a predetermined number of read and write tests on the memory information of the first address to determine whether the first address is faulty. ;
  • the determining unit 36 is configured to determine that a hard error fault occurs at the first address when the test result indicates that the first address is faulty.
  • the device includes, in addition to all the modules shown in FIG. 2, the jump module 24 further includes:
  • the second receiving unit 40 is configured to receive breakpoint abnormality information triggered by an access operation of the hardware breakpoint for the first address;
  • the first jump unit 42 is configured to jump from the access instruction to the instruction information for reading the first address to the second when the memory information indicating the first address of the breakpoint abnormality information is instruction information Address access instruction;
  • the second jump unit 44 is configured to: when the memory information indicating that the first address is the data information, the at least one of: performing an operation from the access instruction for reading the data information of the first address Up to an access instruction to the second address; jumping from an access instruction to the data information writing the first address to an access instruction to the second address.
  • each of the above modules may be implemented by software or hardware.
  • the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the above modules are in any combination.
  • the forms are located in different processors.
  • the embodiment relates to a method for ensuring normal operation of the system under the premise of ensuring that the system does not restart when the storage unit in the memory granules has irreparable hardware damage in the embedded system.
  • the embodiment provides a method and device: when an uncorrectable hard error occurs in the memory of the embedded system, the system can continue to operate normally without restarting the system, and the data communication product can be significantly improved in the market application. Stability.
  • the memory controller supports ECC checksum error correction capabilities, as long as I
  • the memory used by us also supports the ECC check function, which can report an ECC interrupt to the CPU when an error occurs in the memory. After the ECC error occurs, the CPU gives the error address in the ECC error capture address register. At this time, the operating system can handle this ECC interrupt accordingly.
  • the CPU provides a hardware breakpoint function to monitor whether it is reading/writing the specified memory address. This type of read/write includes data read/write and instruction read, either of them, once The specified memory address is read/written and an exception is reported to the operating system. This application utilizes the above two functions provided by the CPU.
  • Breakpoint There are two kinds of hardware breakpoints: one is the instruction breakpoint, which is a breakpoint exception triggered when the CPU fetches the memory. When the breakpoint is abnormal, the address of the instruction itself is equal to the breakpoint address. One is the data breakpoint, which is when the CPU performs a read/write data operation on the address memory, which triggers a breakpoint exception. The result of the breakpoint exception is that the object address of the instruction operation is equal to the breakpoint address. Once the subsequent program triggers the memory address again, it enters a special processing flow to circumvent the wrong memory address.
  • FIG. 5 is a schematic diagram of a process of generating a device according to an embodiment of the present invention. As shown in FIG. 5, the implementation steps of this embodiment include:
  • step S501 the system starts.
  • step S502 A_ok (specified non-faulty special memory address) is used to save the error-corrected data D_ok.
  • Step S503 creating a special hardware data breakpoint exception handling function Vector1.
  • step S601 three special non-faulty memories A1_code1 (Vector1 entry address), A1_code2 (corrected data read/write instruction address) and A1_stack (Vetor1 stack frame address) are specified.
  • Step S602 placing a function in the form of assembly code at A1_code1, the function content is: 1, saving the register values of the breakpoint exception field to A1_stack.
  • Analyze the instruction code C1_old that triggers the hardware breakpoint calculate the specified memory address A_ok that retains the correct data, and the memory address A_error deviation of the ECC error. Correct and recreate the instruction code C1_new1 with this deviation (corrected data read and write) instruction). 3, then put the newly created instruction C1_new1 at a specified memory address A1_code2. 4.
  • add a branch code C1_new2 that jumps to the next assembly code address (C_old+C_old length) of the trigger instruction breakpoint instruction C_old. 5, recover the breakpoint exception field register values from the address A1_stack . 6, jump to the newly created instruction C1_new1.
  • step S603 Vector1 is created.
  • Step S504 creating a special hardware instruction breakpoint exception handling function Vector2.
  • Step S701 after the system is started, two special memory A2_code (Vector2 entry address) and A2_stack (Vetor2 stack frame address) are specified.
  • A2_code Vector2 entry address
  • A2_stack Vetor2 stack frame address
  • Step S702 placing a function in the form of a binary code at A2_code, the function content is: 1, the specified correct data storage address A_ok (specified non-faulty special memory address) is added to the sink code C2_new (corrected A_error)
  • the sink code C2_new is such an instruction: jump from the address of the sink code C2_new (the length of the sink code represented by the data D_ok at the address of A_ok+A_ok) to the next one of the trigger data breakpoint command C_old The code address (C_old+C_old length).
  • the code address (C_old+C_old length).
  • step S703 Vector2 is created.
  • Step S505 modifying the original hardware breakpoint exception processing function Vector of the system.
  • Step S801 determining whether the currently interrupted instruction is that the CPU has accessed the faulty memory address A_error, and if so, continues to determine whether the hardware breakpoint is an instruction breakpoint or a data breakpoint. If it is a data breakpoint, it jumps directly to the special data breakpoint exception handling function Vector1 created in step S503. If it is determined that the hardware breakpoint is an instruction breakpoint, then jump to the special created in step S504. The special hardware instruction breakpoint processing function Vector2. If the currently interrupted instruction is independent of the failed memory address A_error, it is executed according to the normal hardware breakpoint exception handling function Vector.
  • step S802 the Vector is modified.
  • step S506 a memory ECC interrupt processing function vector_ecc is created.
  • Step S901 after the CPU reports an ECC check error, it is determined whether it is a true hard error, and the soft error is excluded.
  • the method is: in the ECC interrupt handler, the memory address A_error of the error occurrence is obtained by the ECC error capture address register of the memory controller, and the data D_error is obtained by the ECC error capture data register of the memory controller, and is obtained by the ECC symptom register of the memory controller. Symptom code D_syndrome and translate to get which bit is faulty, calculate the correct data D_ok. Then, the memory address A_error with the ECC error is subjected to a certain number of 0/1 read/write tests to determine whether the memory address is still faulty.
  • the fault does not occur, it is judged as a soft error, and the previously calculated data D_ok is written back to the address. After A_error, the ECC interrupt processing is exited and the process ends. A fault is confirmed if the fault persists. Save the correct data D_ok to the specified special address A_ok. Make a hardware breakpoint at the faulty memory address A_error. Since a memory address may store data or store code, the address of the stored code may also be treated as data access. Therefore, A1 must simultaneously issue the instruction breakpoint Bc and the data breakpoint Bd.
  • step S902 the vector_ecc is modified.
  • step S507 the device is created.
  • the above steps S501 to S507 are the generation of the device or the creation process of the software.
  • FIG. 10 is a flowchart of a process in which an ECC check error occurs according to an embodiment of the present invention, including the following steps:
  • step S1001 an ECC check error occurs.
  • step S1002 the ECC interrupt processing vector_ecc is entered, and the A_error address is tested.
  • the data D_error is obtained through the ECC error capture data register of the memory controller, and the symptom code D_syndrome is obtained through the ECC symptom register of the memory controller and translated to obtain which bit is faulty, and the correct data D_ok is calculated.
  • Step S1003 performing a certain number of 0/1 readings on the memory address A_error where the ECC error occurs.
  • the write test determines whether the memory address is still faulty. If the fault does not occur, it is judged as a soft error.
  • the previously calculated data D_ok is written back to the address A_error and then exits the ECC interrupt processing, and step S1006 is performed. A fault is confirmed if the fault persists.
  • step S1004 the obtained correct data D_ok is saved to the specified special address A_ok.
  • step S1005 a hardware breakpoint is made at the fault memory address A_error.
  • step S1006 the process ends.
  • FIG. 11 is a flowchart of a process in which a hardware breakpoint exception occurs, including the following steps, according to an embodiment of the present invention:
  • step S1101 a hardware breakpoint exception occurs.
  • step S1102 the hardware breakpoint exception processing function Vector is entered to determine the cause of the interruption.
  • step S1103 it is determined whether the currently interrupted instruction is that the CPU has accessed the faulty memory address A_error, and if so, step S1105 is performed, and if no, step S1104 is performed.
  • Step S1104 is performed according to the normal hardware breakpoint exception processing function Vector, and step S1110 is performed.
  • step S1105 it is determined whether the hardware breakpoint cause is an instruction breakpoint or a data breakpoint. If it is a data breakpoint, step S1108 is performed, and if it is an instruction breakpoint, step S1106 is performed.
  • step S1107 the instruction C2_new is created and jumped to A_ok, and step S1110 is performed.
  • step S1108 a special data breakpoint exception handling function Vector1 is entered.
  • step S1109 new data access instructions C2_new1 and C2_new2 are created and jump to C2_new1.
  • step S1110 the program returns to the original process and continues to execute.
  • Embodiments of the present invention also provide a storage medium.
  • the above storage medium may be configured to store program code for performing the following steps:
  • the foregoing storage medium may include, but not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • mobile hard disk a magnetic disk
  • magnetic disk a magnetic disk
  • optical disk a variety of media that can store program code.
  • the processor performs a hard error failure to determine the first address of the memory according to the stored program code in the storage medium
  • the processor performs error correction on the memory information in the first address according to the stored program code in the storage medium, and stores the error-corrected memory information in the second address in the memory without failure;
  • the processor performs a hardware breakpoint insertion at the first address according to the stored program code in the storage medium, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and from the first address The access instruction jumps to the access instruction to the second address.
  • the modules or steps of the above embodiments of the present invention may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices, which may be implemented by computing devices.
  • the executed program code is implemented such that they can be stored in a storage device by a computing device, and in some cases, the steps shown or described can be performed in a different order than here, or they can be Separately made into individual integrated circuit modules, or make multiple modules or steps of them into a single integrated circuit module achieve.
  • embodiments of the invention are not limited to any specific combination of hardware and software.
  • the memory information in the first address where the hard error occurs can be accessed without interrupting the service and replacing the memory, and the memory unit that is not error-corrected is avoided, thereby avoiding the program error or the system crash. Serious consequences have improved the stability of the system.

Abstract

A method and device for processing a hard memory error, the method comprising: determining that a hard error failure occurs at a first address of a memory (S102); performing error correction on memory information in the first address, and storing the memory information after error correction in a failure-free second address in the memory (S104); and inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is used for monitoring whether the first address is accessed, and jumps from an access instruction for the first address to an access instruction for the second address (S106).

Description

内存硬错误的处理方法及装置Memory hard error processing method and device 技术领域Technical field
本申请涉及但不限于通信领域,尤指一种内存硬错误的处理方法及装置。The present application relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for processing hard memory errors.
背景技术Background technique
在任何计算机系统中,内存都处于关键的地位,系统运行过程中各种数据的存储甚至包括程序本身都存放在内存中。如果内存在程序运行过程中发生了错误,轻则导致程序出错,重则导致系统崩溃。因此,保证内存运行的稳定性十分重要。一个系统的内存实体一般由若干个内存颗粒组成,颗粒内有若干的存储单元,每个单元存储一个bit(比特)的数据。当内存出现错误的时候有可能是1个bit,也有可能是多个bit。内存错误按照发生原因一般分为软错误和硬错误。软错误是随机出现的,例如在内存附近突然出现电子干扰等因素都可能造成内存软错误的发生,具有ECC(Error Checking and Correcting,错误检测纠正)校验功能的内存软错误是可以检测并纠正的;而硬错误是由于硬件的损害或缺陷造成的,因此数据总是不正确,此类错误是无法纠正的。内存通常支持ECC校验和纠错功能,可以对内存的软错误自动纠错,对硬错误可以发现但不能纠错,在一些重要的嵌入式系统应用场景中,例如核心的电信级路由器/交换机总是承载着大量的用户业务,如果系统发生内存故障(这里主要是说内存硬错误)一般没有别的办法,必须重新断电更换内存。但是这样一来将会在一段时间内中断业务,后果较为严重。In any computer system, memory is in a critical position. The storage of various data during the operation of the system, including the program itself, is stored in the memory. If the memory has an error during the running of the program, it will cause the program to go wrong, and the system will crash. Therefore, it is important to ensure the stability of memory operation. A system's memory entity is generally composed of several memory particles. There are several memory cells in the particle, and each cell stores one bit (bit) of data. When there is an error in the memory, there may be 1 bit or multiple bits. Memory errors are generally classified into soft errors and hard errors according to the cause. Soft errors occur randomly. For example, factors such as sudden occurrence of electronic interference near the memory may cause memory soft errors. Memory error with ECC (Error Checking and Correcting) check function can be detected and corrected. The hard error is caused by hardware damage or defects, so the data is always incorrect, and such errors cannot be corrected. Memory usually supports ECC checksum error correction function, which can automatically correct error in soft error of memory. It can be found but can not be corrected for hard errors. In some important embedded system application scenarios, such as core carrier-class routers/switches It always carries a large amount of user services. If there is a memory failure in the system (here, mainly a memory hard error), there is generally no other way, and the memory must be replaced again. However, this will interrupt the business for a period of time, and the consequences will be more serious.
发明概述Summary of invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
本发明实施例提供了一种内存硬错误的处理方法及装置,以实现在不 中断业务的情况下对内存的硬错误进行纠错。The embodiment of the invention provides a method and a device for processing a memory hard error, so as to achieve Correct the hard error of the memory in case of interrupting the service.
根据本发明的一个实施例,提供了一种内存硬错误的处理方法,包括:确定内存的第一地址出现硬错误故障;对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息存放于所述内存中无故障的第二地址;在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。According to an embodiment of the present invention, a method for processing a memory hard error is provided, including: determining that a hard error of the first address of the memory occurs; correcting the memory information in the first address, and correcting the error The latter memory information is stored in the non-faulty second address in the memory; a hardware breakpoint is inserted at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, And jumping from an access instruction to the first address to an access instruction to the second address.
在一实施方式中,确定内存的第一地址出现硬错误故障包括:接收所述内存上报的错误检测纠正ECC中断信号;根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址。In an embodiment, determining that the first address of the memory has a hard error fault comprises: receiving the error detection correction ECC interrupt signal reported by the memory; searching for the corresponding one according to the ECC interrupt signal in the ECC error capture address register First address.
在一实施方式中,在根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址之后,所述方法还包括:将所述第一地址的内存信息做预定次数的读写测试以判断所述第一地址是否故障;在测试结果指示所述第一地址故障时,确定所述第一地址出现硬错误故障。In an embodiment, after searching for the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal, the method further includes: performing memory information of the first address for a predetermined number of times. And reading and writing a test to determine whether the first address is faulty; and when the test result indicates that the first address is faulty, determining that the first address has a hard error fault.
在一实施方式中,从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处包括:接收所述硬件断点针对所述第一地址的访问操作触发的断点异常信息;执行如下操作中的至少之一:所述断点异常信息表征所述第一地址的内存信息是指令信息,从对读取所述第一地址的指令信息的访问指令处跳转至对所述第二地址的访问指令处;在所述断点异常信息表征所述第一地址的内存信息是数据信息时,从对读所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处;所述断点异常信息表征所述第一地址的内存信息是数据信息,从对写所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处。In an embodiment, jumping from the access instruction to the first address to the access instruction to the second address comprises: receiving a break triggered by the access operation of the hardware breakpoint for the first address Pointing up abnormal information; performing at least one of: the breakpoint abnormality information characterizing the memory information of the first address is instruction information, jumping from an access instruction to the instruction information for reading the first address And the access instruction to the second address; when the breakpoint abnormality information indicates that the memory information of the first address is data information, jumping from an access instruction for reading data information of the first address And an access instruction to the second address; the breakpoint abnormality information represents that the memory information of the first address is data information, and jumps from an access instruction to write data information of the first address to a pair The access instruction of the second address.
在一实施方式中,从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处包括:接收对所述第一地址的访问指令;计算所述第一地址到所述第二地址的偏差;将所述第一地址修正所述偏差后得到所述第二地址,并跳转到所述第二地址的访问指令处。In an embodiment, jumping from the access instruction to the first address to the access instruction to the second address comprises: receiving an access instruction to the first address; calculating the first address to Deviation of the second address; correcting the deviation by the first address to obtain the second address, and jumping to an access instruction of the second address.
在一实施方式中,在跳转到所述第二地址之后,所述方法还包括:跳转至对所述第一地址的访问指令的下一条指令。 In an embodiment, after jumping to the second address, the method further comprises: jumping to a next instruction of an access instruction to the first address.
根据本发明的另一个实施例,提供了一种内存硬错误的处理装置,包括:确定模块,设置为确定内存的第一地址出现硬错误故障;纠错模块,设置为对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息存放于所述内存中无故障的第二地址;跳转模块,设置为在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。According to another embodiment of the present invention, a memory hard error processing apparatus is provided, including: a determining module configured to determine that a first address of a memory has a hard error fault; and an error correction module configured to be the first address The memory information is error-corrected, and the error-corrected memory information is stored in the memory-free second address; the jump module is configured to insert a hardware breakpoint at the first address, wherein The hardware breakpoint is for monitoring whether the first address is accessed and jumping from an access instruction to the first address to an access instruction to the second address.
在一实施方式中,所述确定模块包括:第一接收单元,设置为接收所述内存上报的错误检测纠正ECC中断信号;查找单元,设置为根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址。In an embodiment, the determining module includes: a first receiving unit configured to receive the error detection correcting ECC interrupt signal reported by the memory; and a searching unit configured to be in the ECC error trapping address register according to the ECC interrupt signal Finding the corresponding first address.
在一实施方式中,所述确定模块还包括:测试单元,设置为在所述查找单元根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址之后,将所述第一地址的内存信息做预定次数的读写测试以判断所述第一地址是否故障;确定单元,设置为在测试结果指示所述第一地址故障时,确定所述第一地址出现硬错误故障。In an embodiment, the determining module further includes: a testing unit, configured to: after the searching unit searches for the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal, The memory information of the first address is subjected to a predetermined number of read and write tests to determine whether the first address is faulty; and the determining unit is configured to determine that the first address has a hard error when the test result indicates that the first address is faulty .
在一实施方式中,所述跳转模块还包括:第二接收单元,设置为接收所述硬件断点针对所述第一地址的访问操作触发的断点异常信息;第一跳转单元,设置为在所述断点异常信息表征所述第一地址的内存信息是指令信息时,从对读取所述第一地址的指令信息的访问指令处跳转至对所述第二地址的访问指令处;第二跳转单元,设置为在所述断点异常信息表征所述第一地址的内存信息是数据信息时,执行如下操作至少之一:从对读所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处;从对写所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处。In an embodiment, the jump module further includes: a second receiving unit, configured to receive breakpoint abnormality information triggered by the access operation of the hardware breakpoint for the first address; first jump unit, setting In order to represent the memory information of the first address when the breakpoint abnormality information is instruction information, jump from an access instruction to the instruction information that reads the first address to an access instruction to the second address. a second jump unit, configured to: when the memory information indicating the first address is data information, the at least one of: performing operation on reading data information of the first address The access instruction jumps to an access instruction to the second address; jumps from an access instruction to the data information that writes the first address to an access instruction to the second address.
根据本发明的又一个实施例,还提供了一种存储介质。该存储介质设置为存储用于执行以下步骤的程序代码:According to still another embodiment of the present invention, a storage medium is also provided. The storage medium is arranged to store program code for performing the following steps:
确定内存的第一地址出现硬错误故障;Determining that the first address of the memory has a hard error fault;
对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息 存放于所述内存中无故障的第二地址;Performing error correction on the memory information in the first address, and correcting the memory information after the error correction a second address that is stored in the memory without failure;
在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。Inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is for monitoring whether the first address is accessed, and jumping from an access instruction to the first address to the first The address of the second address is accessed.
通过本发明实施例,首先确定内存的第一地址出现硬错误故障,然后对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息存放于所述内存中无故障的第二地址,最后在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。由于将内存中出现硬错误故障的第一地址的内存信息转移到了内存的第二地址,并将对访问原第一地址的访问指令跳转到第二地址的访问指令处,这样可以调取访问第二地址的指令,因此不中断业务和更换内存就可以访问出现硬错误的第一地址中的内存信息,回避了不可纠错的内存存储单元,避免了由此产生的程序出错或者是系统崩溃等严重后果,提高了系统的稳定性。The embodiment of the present invention first determines that a hard error fault occurs in the first address of the memory, and then performs error correction on the memory information in the first address, and stores the error-corrected memory information in the memory. a second address of the fault, and finally a hardware breakpoint is inserted at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed and from an access instruction to the first address Jump to the access instruction to the second address. The memory information of the first address with a hard error fault in the memory is transferred to the second address of the memory, and the access instruction for accessing the original first address is jumped to the access instruction of the second address, so that the access can be accessed. The instruction of the second address, so that the memory information in the first address where the hard error occurs can be accessed without interrupting the service and replacing the memory, avoiding the uncorrectable memory storage unit, and avoiding the program error or the system crash Such serious consequences have improved the stability of the system.
在阅读并理解了附图和详细描述后,可以明白其他方面。Other aspects will be apparent upon reading and understanding the drawings and detailed description.
附图概述BRIEF abstract
图1是根据本发明实施例的内存硬错误的处理方法的流程图;1 is a flow chart of a method for processing a memory hard error according to an embodiment of the present invention;
图2是根据本发明实施例的内存硬错误的处理装置的结构框图;2 is a block diagram showing the structure of a memory hard error processing apparatus according to an embodiment of the present invention;
图3是根据本发明实施例的内存硬错误的处理装置的结构框图一;3 is a structural block diagram 1 of a memory hard error processing apparatus according to an embodiment of the present invention;
图4是根据本发明实施例的内存硬错误的处理装置的结构框图二;4 is a structural block diagram 2 of a memory hard error processing apparatus according to an embodiment of the present invention;
图5是根据本发明实施例的装置的生成过程示意图;FIG. 5 is a schematic diagram of a process of generating a device according to an embodiment of the present invention; FIG.
图6是图5中创建特殊的硬件数据断点异常处理函数Vector1的流程图;6 is a flow chart of creating a special hardware data breakpoint exception handling function Vector1 in FIG. 5;
图7是图5中创建特殊的硬件指令断点异常处理函数Vector2的流程图;7 is a flow chart of creating a special hardware instruction breakpoint exception handling function Vector2 in FIG. 5;
图8是图5中修改系统原有的硬件断点异常处理函数Vector的流程图;8 is a flow chart of the original hardware breakpoint exception handling function Vector of the modified system of FIG. 5;
图9是图5中创建内存ECC中断处理函数vector_ecc的流程图;9 is a flowchart of creating a memory ECC interrupt processing function vector_ecc in FIG. 5;
图10是根据本发明实施例的发生了ECC校验错误的流程图; 10 is a flow chart of an ECC check error occurring in accordance with an embodiment of the present invention;
图11是根据本发明实施例的发生了硬件断点异常的处理流程图。11 is a process flow diagram of a hardware breakpoint exception occurring in accordance with an embodiment of the present invention.
详述Detailed
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。The present application will be described in detail below with reference to the drawings in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first", "second" and the like in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or order.
实施例1Example 1
在本实施例中提供了一种内存硬错误的处理方法,图1是根据本发明实施例的内存硬错误的处理方法的流程图,如图1所示,该流程包括如下步骤:In this embodiment, a method for processing a memory hard error is provided. FIG. 1 is a flowchart of a method for processing a memory hard error according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps:
步骤S102,确定内存的第一地址出现硬错误故障;Step S102, determining that a hard error occurs in the first address of the memory;
步骤S104,对第一地址中的内存信息进行纠错,并将纠错后的内存信息存放于内存中无故障的第二地址;Step S104, performing error correction on the memory information in the first address, and storing the error-corrected memory information in the second address in the memory without failure;
步骤S106,在第一地址处插入硬件断点,其中,硬件断点用于监视第一地址是否被访问,并从对第一地址的访问指令处跳转至对第二地址的访问指令处。Step S106, inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and jumps from the access instruction to the first address to the access instruction to the second address.
通过上述步骤,首先确定内存的第一地址出现硬错误故障,然后对第一地址中的内存信息进行纠错,并将纠错后的内存信息存放于内存中无故障的第二地址,最后在第一地址处插入硬件断点,其中,硬件断点用于监视第一地址是否被访问,并从对第一地址的访问指令处跳转至对第二地址的访问指令处。由于将内存中出现硬错误故障的第一地址的内存信息转移到了内存的第二地址,并从对访问原第一地址的访问指令跳转到第二地址的访问指令处,这样可以调取访问第二地址的指令,因此不中断业务和更换内存就可以访问出现硬错误的第一地址中的内存信息,回避了不可纠错的内存存储单元,避免了由此产生的程序出错或者是系统崩溃等严重后果,提高了系统的稳定性。上述步骤的执行主体可以为处理器,CPU,内 存管理单元等,但不限于此。Through the above steps, first determining that the first address of the memory has a hard error fault, then correcting the memory information in the first address, and storing the error corrected memory information in the second address of the memory without failure, and finally A hardware breakpoint is inserted at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed and jumps from an access instruction to the first address to an access instruction to the second address. The memory information of the first address of the hard error fault in the memory is transferred to the second address of the memory, and the access instruction to access the original first address is jumped to the access instruction of the second address, so that the access can be accessed. The instruction of the second address, so that the memory information in the first address where the hard error occurs can be accessed without interrupting the service and replacing the memory, avoiding the uncorrectable memory storage unit, and avoiding the program error or the system crash Such serious consequences have improved the stability of the system. The execution body of the above steps may be a processor, a CPU, or Save the management unit, etc., but is not limited to this.
在一实施方式中,确定内存的第一地址出现硬错误故障包括:In an embodiment, determining that the first address of the memory has a hard error fault includes:
S11,接收内存上报的错误检测纠正ECC中断信号;S11, receiving an error detection of the memory report to correct the ECC interrupt signal;
S12,根据ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的第一地址。S12. Find a corresponding first address in the ECC error trap address register according to the ECC interrupt signal.
在根据ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的第一地址之后,为了确保故障确实是硬错误故障,还可以对第一地址的存储功能进行检测,在步骤S12之后还可以包括:After searching for the corresponding first address in the ECC error-capture address register according to the ECC interrupt signal, in order to ensure that the fault is indeed a hard-error fault, the storage function of the first address may be detected, and after step S12, the method may further include:
S13,将第一地址的内存信息做预定次数的读/写测试以判断第一地址是否故障;S13. Perform a predetermined number of read/write tests on the memory information of the first address to determine whether the first address is faulty.
S14,在测试结果指示第一地址故障时,确定第一地址出现硬错误故障。S14. When the test result indicates that the first address is faulty, determine that the first address has a hard error fault.
在一实施方式中,从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处包括:In an embodiment, jumping from an access instruction to the first address to an access instruction to the second address comprises:
S21,接收硬件断点针对第一地址的访问操作触发的断点异常信息;S21. Receive a breakpoint abnormality information triggered by an access operation of the hardware breakpoint for the first address.
S22,所述断点异常信息表征所述第一地址的内存信息是指令信息,从对读取所述第一地址的指令信息的访问指令处跳转至对所述第二地址的访问指令处;所述断点异常信息表征所述第一地址的内存信息是数据信息,从对读所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处;所述断点异常信息表征所述第一地址的内存信息是数据信息,从对写所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处。S22, the breakpoint abnormality information represents that the memory information of the first address is instruction information, and jumps from an access instruction to the instruction information that reads the first address to an access instruction to the second address. The breakpoint abnormality information characterizing the memory information of the first address is data information, jumping from an access instruction for reading data information of the first address to an access instruction to the second address; The breakpoint exception information characterizes the memory information of the first address as data information, jumping from an access instruction to the data information writing the first address to an access instruction to the second address.
在一实施方式中,从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处包括:In an embodiment, jumping from an access instruction to the first address to an access instruction to the second address comprises:
S31,接收对所述第一地址的访问指令;S31. Receive an access instruction to the first address.
S32,计算第一地址到第二地址的偏差;S32. Calculate a deviation from the first address to the second address.
S33,将第一地址修正偏差后得到第二地址,并跳转到第二地址的访问 指令处。S33, correcting the deviation of the first address to obtain the second address, and jumping to the access of the second address Command office.
在一实施方式中,在执行完对所述对第二地址的操作指令之后,继续跳转至发生断点异常的指令(即所述第一地址访问的操作指令)的下一条指令。In an embodiment, after the execution of the operation instruction to the second address is performed, the jump to the next instruction of the instruction that generates the breakpoint exception (ie, the operation instruction of the first address access) continues.
在一实施方式中,跳转操作在软件上可以通过不同的函数来实现,可以构造为汇编语言或者是二进值码的形式,用何种语言(如、C语言、java等)构造这个跳转函数在本实施例中并不限定:In an embodiment, the jump operation can be implemented in software by different functions, and can be constructed in the form of assembly language or binary code, and in which language (eg, C language, java, etc.) is used to construct the jump. The transfer function is not limited in this embodiment:
系统启动后在A_ok(指定的无故障的特殊内存地址)用于保存纠错后的数据D_ok。After the system is started, A_ok (specified fault-free special memory address) is used to save the error-corrected data D_ok.
针对保存数据信息的内存地址,创建特殊的硬件数据断点异常处理函数Vector1。指定三段特殊的无故障内存A1_code1(Vector1的入口地址),A1_code2(修正的数据读写指令地址)和A1_stack(Vetor1的栈帧地址)。在A1_code1处放入一段以汇编码形式存在的函数,函数内容是:1,保存断点异常现场各寄存器值到A1_stack。2,分析触发硬件断点的那条指令码C1_old,计算指定的保留正确数据的内存地址A_ok和发生ECC错误的内存地址A_error偏差,用这个偏差修正并重新创建指令码C1_new1(修正的数据读写指令)。3,然后把新创建的这条指令C1_new1放在一个指定的内存地址A1_code2处。4,在新指令C1_new1后再增加一条跳转至触发指令断点指令C_old的下一条汇编码地址处(C_old+C_old长度)的汇编码C1_new2。5,从地址A1_stack恢复断点异常现场各寄存器值。6,跳转至新创建的指令C1_new1处。Create a special hardware data breakpoint exception handler Vector1 for the memory address where the data information is saved. Specify three special non-faulty memories A1_code1 (Vector1 entry address), A1_code2 (corrected data read/write instruction address) and A1_stack (Vetor1 stack frame address). Put a function in the form of assembly code at A1_code1. The function content is: 1. Save the value of each register of the breakpoint exception to A1_stack. 2. Analyze the instruction code C1_old that triggers the hardware breakpoint, calculate the specified memory address A_ok that retains the correct data, and the memory address A_error deviation of the ECC error. Correct and recreate the instruction code C1_new1 with this deviation (corrected data read and write) instruction). 3, then put the newly created instruction C1_new1 at a specified memory address A1_code2. 4. After the new instruction C1_new1, add a branch code C1_new2 that jumps to the next assembly code address (C_old+C_old length) of the trigger instruction breakpoint instruction C_old. 5, recover the breakpoint exception field register values from the address A1_stack . 6, jump to the newly created instruction C1_new1.
针对保存指令信息的内存地址,创建特殊的硬件指令断点异常处理函数Vector2。系统启动后指定两段特殊的内存A2_code(Vector2的入口地址)和A2_stack(Vetor2的栈帧地址)。在A2_code处放入一段以二进制码形式存在的函数,函数内容是:1,指定的正确数据存放地址A_ok(指定的无故障的特殊内存地址)后增加汇编码C2_new(修正的A_error处的原有程序指令),汇编码C2_new是这样一条指令:从本汇编码C2_new所在地址(A_ok+A_ok地址处的数据D_ok代表的汇编码的长度)跳转到触发数据断点指令C_old的下一条汇编码地址处(C_old+C_old长度)。2,从地址 A2_stack恢复断点异常现场各寄存器值。3,跳转至A_ok。Create a special hardware instruction breakpoint exception handler Vector2 for the memory address where the instruction information is saved. After the system starts, specify two special memory A2_code (Vector2 entry address) and A2_stack (Vetor2 stack frame address). Put a function in binary code form at A2_code, the function content is: 1, the specified correct data storage address A_ok (specified non-faulty special memory address) and then add the sink code C2_new (corrected A_error at the original Program instruction), the sink code C2_new is such an instruction: jump from the address of the sink code C2_new (the length of the sink code represented by the data D_ok at the A_ok+A_ok address) to the next sink code address of the trigger data breakpoint command C_old (C_old+C_old length). 2, from the address A2_stack restores the register values of the breakpoint exception field. 3. Jump to A_ok.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present application can be embodied in the form of a software product stored in a storage medium (such as a ROM/RAM, a magnetic disk, an optical disk), and includes a plurality of instructions for making a terminal. The device (which may be a cell phone, computer, server, or network device, etc.) performs the methods of various embodiments of the present invention.
实施例2Example 2
在本实施例中还提供了一种内存硬错误的处理装置,该装置用于实现上述实施例及实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置可以以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In the embodiment, a memory hard error processing device is also provided, which is used to implement the foregoing embodiments and implementation manners, and has not been described again. As used below, the term "module" may implement a combination of software and/or hardware of a predetermined function. Although the devices described in the following embodiments may be implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
图2是根据本发明实施例的内存硬错误的处理装置的结构框图,如图2所示,该装置包括:2 is a structural block diagram of a memory hard error processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes:
确定模块20,设置为确定内存的第一地址出现硬错误故障;The determining module 20 is configured to determine that the first address of the memory has a hard error fault;
纠错模块22,设置为对第一地址中的内存信息进行纠错,并将纠错后的内存信息存放于内存中无故障的第二地址;The error correction module 22 is configured to perform error correction on the memory information in the first address, and store the error-corrected memory information in the second address in the memory without failure;
跳转模块24,设置为在第一地址处插入硬件断点,其中,硬件断点用于监视第一地址是否被访问,并从对第一地址的访问指令处跳转至对第二地址的访问指令处。The jump module 24 is configured to insert a hardware breakpoint at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and jump from the access instruction to the first address to the second address Access to the instruction.
图3是根据本发明实施例的内存硬错误的处理装置的结构框图一,如图3所示,该装置除包括图2所示的所有模块外,确定模块20包括:FIG. 3 is a block diagram showing the structure of a memory hard error processing apparatus according to an embodiment of the present invention. As shown in FIG. 3, the determining module 20 includes:
第一接收单元30,设置为接收内存上报的错误检测纠正ECC中断信号;The first receiving unit 30 is configured to receive an error detection correction ECC interrupt signal reported by the memory;
查找单元32,设置为根据ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的第一地址。 The searching unit 32 is configured to look up the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal.
在另一个实施例中,确定模块20还包括:In another embodiment, the determining module 20 further includes:
测试单元34,设置为在查找单元根据ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的第一地址之后,将第一地址的内存信息做预定次数的读写测试以判断第一地址是否故障;The test unit 34 is configured to: after the search unit searches for the corresponding first address in the ECC error trap address register according to the ECC interrupt signal, perform a predetermined number of read and write tests on the memory information of the first address to determine whether the first address is faulty. ;
确定单元36,设置为在测试结果指示第一地址故障时,确定第一地址出现硬错误故障。The determining unit 36 is configured to determine that a hard error fault occurs at the first address when the test result indicates that the first address is faulty.
图4是根据本发明实施例的内存硬错误的处理装置的结构框图二,如图4所示,该装置除包括图2所示的所有模块外,跳转模块24还包括:4 is a structural block diagram 2 of a memory hard error processing apparatus according to an embodiment of the present invention. As shown in FIG. 4, the device includes, in addition to all the modules shown in FIG. 2, the jump module 24 further includes:
第二接收单元40,设置为接收硬件断点针对第一地址的访问操作触发的断点异常信息;The second receiving unit 40 is configured to receive breakpoint abnormality information triggered by an access operation of the hardware breakpoint for the first address;
第一跳转单元42,设置为在断点异常信息表征第一地址的内存信息是指令信息时,从对读取所述第一地址的指令信息的访问指令处跳转至对所述第二地址的访问指令处;The first jump unit 42 is configured to jump from the access instruction to the instruction information for reading the first address to the second when the memory information indicating the first address of the breakpoint abnormality information is instruction information Address access instruction;
第二跳转单元44,设置为在断点异常信息表征第一地址的内存信息是数据信息时,执行如下操作至少之一:从对读所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处;从对写所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处。The second jump unit 44 is configured to: when the memory information indicating that the first address is the data information, the at least one of: performing an operation from the access instruction for reading the data information of the first address Up to an access instruction to the second address; jumping from an access instruction to the data information writing the first address to an access instruction to the second address.
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that each of the above modules may be implemented by software or hardware. For the latter, the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the above modules are in any combination. The forms are located in different processors.
实施例3Example 3
本实施例涉及了一种在嵌入式系统中在内存颗粒中的存储单元发生了不可恢复的硬件损害时的,在保证系统不重新启动前提下保证系统正常运行的方法。The embodiment relates to a method for ensuring normal operation of the system under the premise of ensuring that the system does not restart when the storage unit in the memory granules has irreparable hardware damage in the embedded system.
本实施例提供了一种方法和装置:在嵌入式系统的内存发生无法纠正的硬错误时可以不需要重新启动系统的前提下继续保持系统继续正常运行,可以显著提高数据通信产品在市场应用中的稳定性。The embodiment provides a method and device: when an uncorrectable hard error occurs in the memory of the embedded system, the system can continue to operate normally without restarting the system, and the data communication product can be significantly improved in the market application. Stability.
嵌入式CPU架构中,内存控制器都支持ECC校验和纠错能力,只要我 们使用的内存也支持ECC校验功能,就可以在内存发生错误的时候上报CPU一个ECC中断,CPU在发生ECC错误后,在ECC错误捕捉地址寄存器中给出出错的地址。这个时候操作系统可以对这个ECC中断作出相应处理。此外,CPU还提供了一种硬件断点的功能用来监视自己是否在读/写指定的内存地址,这种读/写包括数据读/写和指令读两类,无论是其中哪一种,一旦指定的这个内存地址被读/写则会触发一个异常上报给操作系统。本申请利用了CPU提供的以上两个功能,在发生ECC中断后,先计算出正确的数据放入一个特定的没有发生ECC错误的内存地址处,再通过在发生ECC错误的内存地址处插入硬件断点。硬件断点分为两种:一种是指令断点,就是当CPU取指令导致的读该地址内存时会触发一个断点异常,导致断点异常的是指令本身所在地址等于断点地址。一种是数据断点,就是当CPU对该地址内存做读/写数据操作时会触发一个断点异常,导致断点异常的是指令操作的对象地址等于断点地址。一旦后续程序再次触发该内存地址则进入一个特殊的处理流程来回避这个错误的内存地址。其中,如果触发的是数据断点,则让程序去读/写已经纠正过的正确数据所在地址;如果触发的是指令断点,则直接跳转至纠正过的正确数据所在地址。这样系统就可以忽略内存的硬错误。通过这个装置,类似于电信级的数据通信产品这种对业务稳定性非常敏感的设备即便发生了内存硬错误这样的致命的硬件故障,系统也可以在业务不受影响的情况下继续运行。In the embedded CPU architecture, the memory controller supports ECC checksum error correction capabilities, as long as I The memory used by us also supports the ECC check function, which can report an ECC interrupt to the CPU when an error occurs in the memory. After the ECC error occurs, the CPU gives the error address in the ECC error capture address register. At this time, the operating system can handle this ECC interrupt accordingly. In addition, the CPU provides a hardware breakpoint function to monitor whether it is reading/writing the specified memory address. This type of read/write includes data read/write and instruction read, either of them, once The specified memory address is read/written and an exception is reported to the operating system. This application utilizes the above two functions provided by the CPU. After the ECC interrupt occurs, the correct data is first calculated into a specific memory address where no ECC error occurs, and then the hardware is inserted at the memory address where the ECC error occurs. Breakpoint. There are two kinds of hardware breakpoints: one is the instruction breakpoint, which is a breakpoint exception triggered when the CPU fetches the memory. When the breakpoint is abnormal, the address of the instruction itself is equal to the breakpoint address. One is the data breakpoint, which is when the CPU performs a read/write data operation on the address memory, which triggers a breakpoint exception. The result of the breakpoint exception is that the object address of the instruction operation is equal to the breakpoint address. Once the subsequent program triggers the memory address again, it enters a special processing flow to circumvent the wrong memory address. Among them, if the data breakpoint is triggered, let the program read/write the address of the correct data that has been corrected; if the instruction breakpoint is triggered, it will jump directly to the address of the corrected correct data. This way the system can ignore hard errors in memory. With this device, a device that is very sensitive to service stability, such as a carrier-grade data communication product, can continue to operate without being affected by the service even if a fatal hardware failure such as a memory hard error occurs.
图5是根据本发明实施例的装置的生成过程示意图,如图5所示,本实施例的实现步骤包括:FIG. 5 is a schematic diagram of a process of generating a device according to an embodiment of the present invention. As shown in FIG. 5, the implementation steps of this embodiment include:
步骤S501,系统启动。In step S501, the system starts.
步骤S502,在A_ok(指定的无故障的特殊内存地址)用于保存纠错后的数据D_ok。In step S502, A_ok (specified non-faulty special memory address) is used to save the error-corrected data D_ok.
步骤S503,创建特殊的硬件数据断点异常处理函数Vector1。Step S503, creating a special hardware data breakpoint exception handling function Vector1.
其中,参照图6,包括如下步骤:Wherein, referring to FIG. 6, the following steps are included:
步骤S601,指定三段特殊的无故障内存A1_code1(Vector1的入口地址),A1_code2(修正的数据读写指令地址)和A1_stack(Vetor1的栈帧地址)。 In step S601, three special non-faulty memories A1_code1 (Vector1 entry address), A1_code2 (corrected data read/write instruction address) and A1_stack (Vetor1 stack frame address) are specified.
步骤S602,在A1_code1处放入一段以汇编码形式存在的函数,函数内容是:1,保存断点异常现场各寄存器值到A1_stack。2,分析触发硬件断点的那条指令码C1_old,计算指定的保留正确数据的内存地址A_ok和发生ECC错误的内存地址A_error偏差,用这个偏差修正并重新创建指令码C1_new1(修正的数据读写指令)。3,然后把新创建的这条指令C1_new1放在一个指定的内存地址A1_code2处。4,在新指令C1_new1后再增加一条跳转至触发指令断点指令C_old的下一条汇编码地址处(C_old+C_old长度)的汇编码C1_new2。5,从地址A1_stack恢复断点异常现场各寄存器值。6,跳转至新创建的指令C1_new1处。Step S602, placing a function in the form of assembly code at A1_code1, the function content is: 1, saving the register values of the breakpoint exception field to A1_stack. 2. Analyze the instruction code C1_old that triggers the hardware breakpoint, calculate the specified memory address A_ok that retains the correct data, and the memory address A_error deviation of the ECC error. Correct and recreate the instruction code C1_new1 with this deviation (corrected data read and write) instruction). 3, then put the newly created instruction C1_new1 at a specified memory address A1_code2. 4. After the new instruction C1_new1, add a branch code C1_new2 that jumps to the next assembly code address (C_old+C_old length) of the trigger instruction breakpoint instruction C_old. 5, recover the breakpoint exception field register values from the address A1_stack . 6, jump to the newly created instruction C1_new1.
步骤S603,Vector1创建完毕。In step S603, Vector1 is created.
步骤S504,创建特殊的硬件指令断点异常处理函数Vector2。Step S504, creating a special hardware instruction breakpoint exception handling function Vector2.
其中,参照图7,包括如下步骤:Wherein, referring to FIG. 7, the following steps are included:
步骤S701,系统启动后指定两段特殊的内存A2_code(Vector2的入口地址)和A2_stack(Vetor2的栈帧地址)。Step S701, after the system is started, two special memory A2_code (Vector2 entry address) and A2_stack (Vetor2 stack frame address) are specified.
步骤S702,在A2_code处放入一段以二进制码形式存在的函数,函数内容是:1,指定的正确数据存放地址A_ok(指定的无故障的特殊内存地址)后增加汇编码C2_new(修正的A_error处的原有程序指令),汇编码C2_new是这样一条指令:从本汇编码C2_new所在地址(A_ok+A_ok地址处的数据D_ok代表的汇编码的长度)跳转到触发数据断点指令C_old的下一条汇编码地址处(C_old+C_old长度)。2,从地址A2_stack恢复断点异常现场各寄存器值。3,跳转至A_ok。Step S702, placing a function in the form of a binary code at A2_code, the function content is: 1, the specified correct data storage address A_ok (specified non-faulty special memory address) is added to the sink code C2_new (corrected A_error) The original program instruction), the sink code C2_new is such an instruction: jump from the address of the sink code C2_new (the length of the sink code represented by the data D_ok at the address of A_ok+A_ok) to the next one of the trigger data breakpoint command C_old The code address (C_old+C_old length). 2. Restore the register values of the breakpoint exception field from the address A2_stack. 3. Jump to A_ok.
步骤S703,Vector2创建完毕。In step S703, Vector2 is created.
步骤S505,修改系统原有的硬件断点异常处理函数Vector。Step S505, modifying the original hardware breakpoint exception processing function Vector of the system.
其中,参照图8,包括如下步骤:Wherein, referring to FIG. 8, the following steps are included:
步骤S801,判断当前被中断的指令是否是CPU访问了故障内存地址A_error,如果是则继续判断硬件断点原因是指令断点还是数据断点。如果是数据断点,则直接跳转至步骤S503中创建的特殊的数据断点异常处理函数Vector1。如果判断硬件断点是指令断点,则跳转至步骤S504中创建的特 殊硬件指令断点处理函数Vector2。如果当前被中断的指令和故障内存地址A_error无关,则按照正常的硬件断点异常处理函数Vector执行。Step S801, determining whether the currently interrupted instruction is that the CPU has accessed the faulty memory address A_error, and if so, continues to determine whether the hardware breakpoint is an instruction breakpoint or a data breakpoint. If it is a data breakpoint, it jumps directly to the special data breakpoint exception handling function Vector1 created in step S503. If it is determined that the hardware breakpoint is an instruction breakpoint, then jump to the special created in step S504. The special hardware instruction breakpoint processing function Vector2. If the currently interrupted instruction is independent of the failed memory address A_error, it is executed according to the normal hardware breakpoint exception handling function Vector.
步骤S802,Vector修改完毕。In step S802, the Vector is modified.
步骤S506,创建内存ECC中断处理函数vector_ecc。In step S506, a memory ECC interrupt processing function vector_ecc is created.
其中,参照图9,包括如下步骤:Wherein, referring to FIG. 9, the following steps are included:
步骤S901,当CPU报告ECC校验错误后,判断是否是真的硬错误,排除软错误。做法是:ECC中断处理程序中,通过内存控制器的ECC错误捕捉地址寄存器获得错误发生的内存地址A_error,通过内存控制器的ECC错误捕捉数据寄存器获得数据D_error,通过内存控制器的ECC症状寄存器获得症状码D_syndrome并翻译获得是哪个bit出现故障,计算得出正确的数据D_ok。然后对发生ECC错误的内存地址A_error做一定次数的0/1读/写测试判断该内存地址是否依然故障,如故障没有出现则判断为软错误,将之前计算得出的数据D_ok写回到地址A_error后退出ECC中断处理,流程结束。如果故障依然存在则确认为硬错误。把得到的正确数据D_ok保存到指定的特殊地址A_ok处。在故障内存地址A_error处打硬件断点。由于一个内存地址既有可能存放数据也有可能存放代码,存放代码的地址也有可能被当作数据访问,因此A1处要同时打指令断点Bc和数据断点Bd。Step S901, after the CPU reports an ECC check error, it is determined whether it is a true hard error, and the soft error is excluded. The method is: in the ECC interrupt handler, the memory address A_error of the error occurrence is obtained by the ECC error capture address register of the memory controller, and the data D_error is obtained by the ECC error capture data register of the memory controller, and is obtained by the ECC symptom register of the memory controller. Symptom code D_syndrome and translate to get which bit is faulty, calculate the correct data D_ok. Then, the memory address A_error with the ECC error is subjected to a certain number of 0/1 read/write tests to determine whether the memory address is still faulty. If the fault does not occur, it is judged as a soft error, and the previously calculated data D_ok is written back to the address. After A_error, the ECC interrupt processing is exited and the process ends. A fault is confirmed if the fault persists. Save the correct data D_ok to the specified special address A_ok. Make a hardware breakpoint at the faulty memory address A_error. Since a memory address may store data or store code, the address of the stored code may also be treated as data access. Therefore, A1 must simultaneously issue the instruction breakpoint Bc and the data breakpoint Bd.
步骤S902,vector_ecc修改完毕。In step S902, the vector_ecc is modified.
步骤S507,装置创建完毕。In step S507, the device is created.
以上步骤S501至步骤S507是装置的生成或者是软件的创建过程。The above steps S501 to S507 are the generation of the device or the creation process of the software.
图10是根据本发明实施例的发生了ECC校验错误的处理流程图,包括如下步骤:FIG. 10 is a flowchart of a process in which an ECC check error occurs according to an embodiment of the present invention, including the following steps:
步骤S1001,发生ECC校验错误。In step S1001, an ECC check error occurs.
步骤S1002,进入ECC中断处理vector_ecc,测试A_error地址。In step S1002, the ECC interrupt processing vector_ecc is entered, and the A_error address is tested.
本步骤中,通过内存控制器的ECC错误捕捉数据寄存器获得数据D_error,通过内存控制器的ECC症状寄存器获得症状码D_syndrome并翻译获得是哪个bit出现故障,计算得出正确的数据D_ok。In this step, the data D_error is obtained through the ECC error capture data register of the memory controller, and the symptom code D_syndrome is obtained through the ECC symptom register of the memory controller and translated to obtain which bit is faulty, and the correct data D_ok is calculated.
步骤S1003,对发生ECC错误的内存地址A_error做一定次数的0/1读/ 写测试判断该内存地址是否依然故障,如故障没有出现则判断为软错误,将之前计算得出的数据D_ok写回到地址A_error后退出ECC中断处理,执行步骤S1006。如果故障依然存在则确认为硬错误。Step S1003, performing a certain number of 0/1 readings on the memory address A_error where the ECC error occurs. The write test determines whether the memory address is still faulty. If the fault does not occur, it is judged as a soft error. The previously calculated data D_ok is written back to the address A_error and then exits the ECC interrupt processing, and step S1006 is performed. A fault is confirmed if the fault persists.
步骤S1004,把得到的正确数据D_ok保存到指定的特殊地址A_ok处。In step S1004, the obtained correct data D_ok is saved to the specified special address A_ok.
步骤S1005,在故障内存地址A_error处打硬件断点。In step S1005, a hardware breakpoint is made at the fault memory address A_error.
步骤S1006,流程结束。In step S1006, the process ends.
图11是根据本发明实施例的发生了硬件断点异常的处理流程图,包括如下步骤:11 is a flowchart of a process in which a hardware breakpoint exception occurs, including the following steps, according to an embodiment of the present invention:
步骤S1101,发生了硬件断点异常。In step S1101, a hardware breakpoint exception occurs.
步骤S1102,进入硬件断点异常处理函数Vector,判断中断原因。In step S1102, the hardware breakpoint exception processing function Vector is entered to determine the cause of the interruption.
步骤S1103,判断当前被中断的指令是否是CPU访问了故障内存地址A_error,如果是,则执行步骤S1105,如果否,则执行步骤S1104。In step S1103, it is determined whether the currently interrupted instruction is that the CPU has accessed the faulty memory address A_error, and if so, step S1105 is performed, and if no, step S1104 is performed.
步骤S1104,按照正常的硬件断点异常处理函数Vector执行,执行步骤S1110。Step S1104 is performed according to the normal hardware breakpoint exception processing function Vector, and step S1110 is performed.
步骤S1105,判断硬件断点原因是指令断点还是数据断点,如果是数据断点,执行步骤S1108,如果是指令断点,执行步骤S1106。In step S1105, it is determined whether the hardware breakpoint cause is an instruction breakpoint or a data breakpoint. If it is a data breakpoint, step S1108 is performed, and if it is an instruction breakpoint, step S1106 is performed.
步骤S1106,特殊硬件指令断点处理函数Vector2。Step S1106, the special hardware instruction breakpoint processing function Vector2.
步骤S1107,创建指令C2_new并跳转至A_ok,执行步骤S1110。In step S1107, the instruction C2_new is created and jumped to A_ok, and step S1110 is performed.
步骤S1108,进入特殊的数据断点异常处理函数Vector1。In step S1108, a special data breakpoint exception handling function Vector1 is entered.
步骤S1109,创建新的数据访问指令C2_new1和C2_new2,并跳转至C2_new1。In step S1109, new data access instructions C2_new1 and C2_new2 are created and jump to C2_new1.
步骤S1110,回到程序原流程继续执行。In step S1110, the program returns to the original process and continues to execute.
如图10和11所示,通过上述装置,在系统运行过程中一旦发生内存硬错误,错误的内存地址会被重定向到一个好无故障的内存地址处,当CPU访问到故障内存地址时会转而去访问重定向过的内存地址,回避了不可纠错的内存存储单元,避免了程序出错或者是系统崩溃等严重后果,提高了系统的稳定性。 As shown in Figures 10 and 11, through the above device, if a memory hard error occurs during system operation, the wrong memory address will be redirected to a good non-fault memory address, when the CPU accesses the failed memory address. Turning to access the redirected memory address, avoiding the non-correctable memory storage unit, avoiding serious consequences such as program error or system crash, and improving the stability of the system.
实施例4Example 4
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:Embodiments of the present invention also provide a storage medium. In this embodiment, the above storage medium may be configured to store program code for performing the following steps:
S1,确定内存的第一地址出现硬错误故障;S1, determining that the first address of the memory has a hard error fault;
S2,对第一地址中的内存信息进行纠错,并将纠错后的内存信息存放于内存中无故障的第二地址;S2, performing error correction on the memory information in the first address, and storing the error-corrected memory information in the second address of the memory without failure;
S3,在第一地址处插入硬件断点,其中,硬件断点用于监视第一地址是否被访问,并从对第一地址的访问指令处跳转至对第二地址的访问指令处。S3, inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and jumps from an access instruction to the first address to an access instruction to the second address.
在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In this embodiment, the foregoing storage medium may include, but not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk. A variety of media that can store program code.
在本实施例中,处理器根据存储介质中已存储的程序代码执行确定内存的第一地址出现硬错误故障;In this embodiment, the processor performs a hard error failure to determine the first address of the memory according to the stored program code in the storage medium;
在本实施例中,处理器根据存储介质中已存储的程序代码执行对第一地址中的内存信息进行纠错,并将纠错后的内存信息存放于内存中无故障的第二地址;In this embodiment, the processor performs error correction on the memory information in the first address according to the stored program code in the storage medium, and stores the error-corrected memory information in the second address in the memory without failure;
在本实施例中,处理器根据存储介质中已存储的程序代码执行在第一地址处插入硬件断点,其中,硬件断点用于监视第一地址是否被访问,并从对第一地址的访问指令处跳转至对第二地址的访问指令处。In this embodiment, the processor performs a hardware breakpoint insertion at the first address according to the stored program code in the storage medium, wherein the hardware breakpoint is used to monitor whether the first address is accessed, and from the first address The access instruction jumps to the access instruction to the second address.
本实施例中的示例可以参考上述实施例及实施方式中所描述的示例,本实施例在此不再赘述。For examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and implementation manners, and details are not described herein again.
上述的本发明实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来 实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。The modules or steps of the above embodiments of the present invention may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices, which may be implemented by computing devices. The executed program code is implemented such that they can be stored in a storage device by a computing device, and in some cases, the steps shown or described can be performed in a different order than here, or they can be Separately made into individual integrated circuit modules, or make multiple modules or steps of them into a single integrated circuit module achieve. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
以上所述仅为本发明的实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above description is only for the embodiments of the present invention, and is not intended to limit the present application, and various changes and modifications may be made by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this application are intended to be included within the scope of the present application.
工业实用性Industrial applicability
通过本发明实施例,不中断业务和更换内存就可以访问出现硬错误的第一地址中的内存信息,回避了不可纠错的内存存储单元,避免了由此产生的程序出错或者是系统崩溃等严重后果,提高了系统的稳定性。 According to the embodiment of the present invention, the memory information in the first address where the hard error occurs can be accessed without interrupting the service and replacing the memory, and the memory unit that is not error-corrected is avoided, thereby avoiding the program error or the system crash. Serious consequences have improved the stability of the system.

Claims (11)

  1. 一种内存硬错误的处理方法,包括:A method for handling memory hard errors, including:
    确定内存的第一地址出现硬错误故障;Determining that the first address of the memory has a hard error fault;
    对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息存放于所述内存中无故障的第二地址;Performing error correction on the memory information in the first address, and storing the error-corrected memory information in the second address in the memory without failure;
    在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。Inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is for monitoring whether the first address is accessed, and jumping from an access instruction to the first address to the first The address of the second address is accessed.
  2. 根据权利要求1所述的方法,其中,所述确定内存的第一地址出现硬错误故障包括:The method of claim 1 wherein said determining a first address of the memory with a hard error fault comprises:
    接收所述内存上报的错误检测纠正ECC中断信号;Receiving the error detection correction ECC interrupt signal reported by the memory;
    根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址。Finding the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal.
  3. 根据权利要求2所述的方法,其中,在根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址之后,所述方法还包括:The method of claim 2, wherein after the finding the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal, the method further comprises:
    将所述第一地址的内存信息做预定次数的读写测试以判断所述第一地址是否故障;Performing a predetermined number of read and write tests on the memory information of the first address to determine whether the first address is faulty;
    在测试结果指示所述第一地址故障时,确定所述第一地址出现硬错误故障。When the test result indicates that the first address is faulty, it is determined that the first address has a hard error fault.
  4. 根据权利要求1所述的方法,其中,所述从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处包括:The method of claim 1, wherein the jump from an access instruction to the first address to an access instruction to the second address comprises:
    接收所述硬件断点针对所述第一地址的访问操作触发的断点异常信息;Receiving breakpoint abnormality information triggered by the hardware breakpoint for the access operation of the first address;
    执行如下操作中的至少之一:Do at least one of the following:
    所述断点异常信息表征所述第一地址的内存信息是指令信息,从对读取所述第一地址的指令信息的访问指令处跳转至对所述第二地址的访问指令处;The breakpoint abnormality information represents that the memory information of the first address is instruction information, and jumps from an access instruction to the instruction information that reads the first address to an access instruction to the second address;
    所述断点异常信息表征所述第一地址的内存信息是数据信息,从对读所 述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处;The breakpoint abnormality information represents that the memory information of the first address is data information, from the reading site The access instruction of the data information of the first address jumps to the access instruction to the second address;
    所述断点异常信息表征所述第一地址的内存信息是数据信息,从对写所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处。The breakpoint abnormality information represents that the memory information of the first address is data information, and jumps from an access instruction to the data information of the first address to an access instruction to the second address.
  5. 根据权利要求1所述的方法,其中,所述从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处包括:The method of claim 1, wherein the jump from an access instruction to the first address to an access instruction to the second address comprises:
    接收对所述第一地址的访问指令;Receiving an access instruction to the first address;
    计算所述第一地址到所述第二地址的偏差;Calculating a deviation of the first address to the second address;
    将所述第一地址修正所述偏差后得到所述第二地址,并跳转到所述第二地址的访问指令处。Correcting the deviation by the first address to obtain the second address, and jumping to the access instruction of the second address.
  6. 根据权利要求5所述的方法,其中,在跳转到所述第二地址之后,所述方法还包括:The method of claim 5, wherein after jumping to the second address, the method further comprises:
    跳转至对所述第一地址的访问指令的下一条指令。Jumps to the next instruction of the access instruction to the first address.
  7. 一种内存硬错误的处理装置,包括:A memory hard error processing device, comprising:
    确定模块,设置为确定内存的第一地址出现硬错误故障;Determining the module, setting a hard error to the first address of the memory;
    纠错模块,设置为对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息存放于所述内存中无故障的第二地址;An error correction module is configured to perform error correction on the memory information in the first address, and store the error-corrected memory information in a second address in the memory that is not faulty;
    跳转模块,设置为在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。a jump module configured to insert a hardware breakpoint at the first address, wherein the hardware breakpoint is configured to monitor whether the first address is accessed and jump from an access instruction to the first address Go to the access instruction to the second address.
  8. 根据权利要求7所述的装置,其中,所述确定模块包括:The apparatus of claim 7, wherein the determining module comprises:
    第一接收单元,设置为接收所述内存上报的错误检测纠正ECC中断信号;a first receiving unit, configured to receive an error detection correction ECC interrupt signal reported by the memory;
    查找单元,设置为根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址。The searching unit is configured to search for the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal.
  9. 根据权利要求8所述的装置,其中,所述确定模块还包括:The apparatus of claim 8, wherein the determining module further comprises:
    测试单元,设置为在所述查找单元根据所述ECC中断信号在ECC错误捕捉地址寄存器中查找得到对应的所述第一地址之后,将所述第一地址的内 存信息做预定次数的读写测试以判断所述第一地址是否故障;a testing unit, configured to: after the searching unit searches for the corresponding first address in the ECC error trapping address register according to the ECC interrupt signal, Store the information for a predetermined number of read and write tests to determine whether the first address is faulty;
    确定单元,设置为在测试结果指示所述第一地址故障时,确定所述第一地址出现硬错误故障。The determining unit is configured to determine that the first address has a hard error fault when the test result indicates that the first address is faulty.
  10. 根据权利要求7所述的装置,其中,所述跳转模块还包括:The apparatus of claim 7, wherein the jump module further comprises:
    第二接收单元,设置为接收所述硬件断点针对所述第一地址的访问操作触发的断点异常信息;a second receiving unit, configured to receive breakpoint abnormality information triggered by an access operation of the hardware breakpoint for the first address;
    第一跳转单元,设置为在所述断点异常信息表征所述第一地址的内存信息是指令信息时,从对读取所述第一地址的指令信息的访问指令处跳转至对所述第二地址的访问指令处;a first jump unit, configured to jump from an access instruction to the instruction information for reading the first address to the opposite location when the memory information indicating the first address is the instruction information The access instruction of the second address;
    第二跳转单元,设置为在所述断点异常信息表征所述第一地址的内存信息是数据信息时,执行如下操作至少之一:从对读所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处;从对写所述第一地址的数据信息的访问指令处跳转至对所述第二地址的访问指令处。a second jump unit, configured to: when the memory information indicating the first address is the data information, the at least one of: performing an access instruction from the data information of the first address Jumping to an access instruction to the second address; jumping from an access instruction to the data information writing the first address to an access instruction to the second address.
  11. 一种存储介质,所述存储介质设置为存储用于执行以下步骤的程序代码:A storage medium configured to store program code for performing the following steps:
    确定内存的第一地址出现硬错误故障;Determining that the first address of the memory has a hard error fault;
    对所述第一地址中的内存信息进行纠错,并将纠错后的所述内存信息存放于所述内存中无故障的第二地址;Performing error correction on the memory information in the first address, and storing the error-corrected memory information in the second address in the memory without failure;
    在所述第一地址处插入硬件断点,其中,所述硬件断点用于监视所述第一地址是否被访问,并从对所述第一地址的访问指令处跳转至对所述第二地址的访问指令处。 Inserting a hardware breakpoint at the first address, wherein the hardware breakpoint is for monitoring whether the first address is accessed, and jumping from an access instruction to the first address to the first The address of the second address is accessed.
PCT/CN2017/083815 2016-06-16 2017-05-10 Method and device for processing hard memory error WO2017215377A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610436762.5A CN107516547A (en) 2016-06-16 2016-06-16 The processing method and processing device of internal memory hard error
CN201610436762.5 2016-06-16

Publications (1)

Publication Number Publication Date
WO2017215377A1 true WO2017215377A1 (en) 2017-12-21

Family

ID=60663956

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/083815 WO2017215377A1 (en) 2016-06-16 2017-05-10 Method and device for processing hard memory error

Country Status (2)

Country Link
CN (1) CN107516547A (en)
WO (1) WO2017215377A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256502A (en) * 2020-09-25 2021-01-22 新华三半导体技术有限公司 Memory performance test method, device and chip
CN114780283A (en) * 2022-06-20 2022-07-22 新华三信息技术有限公司 Fault processing method and device
CN117440166A (en) * 2023-09-19 2024-01-23 北京麟卓信息科技有限公司 Video coding and decoding mode detection method based on memory access characteristic analysis
CN117440166B (en) * 2023-09-19 2024-04-26 北京麟卓信息科技有限公司 Video coding and decoding mode detection method based on memory access characteristic analysis

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108307246B (en) * 2018-01-09 2020-02-07 武汉斗鱼网络科技有限公司 Method, storage medium, equipment and system for calculating popularity of live broadcast room
EP3557422A4 (en) * 2018-03-09 2020-01-08 Shenzhen Goodix Technology Co., Ltd. Method for accessing code sram, and electronic device
CN110858345A (en) * 2018-08-23 2020-03-03 阿里巴巴集团控股有限公司 Material detection method and device
CN109388511B (en) * 2018-09-14 2021-05-18 联想(北京)有限公司 Information processing method, electronic equipment and computer storage medium
KR102589402B1 (en) 2018-10-04 2023-10-13 삼성전자주식회사 Storage device and method for operating storage device
CN111625387B (en) * 2020-05-27 2024-03-29 北京金山云网络技术有限公司 Memory error processing method, device and server
CN111966521B (en) * 2020-08-17 2023-10-13 成都海光集成电路设计有限公司 Hardware error processing method, processor, controller, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117236A (en) * 2008-12-30 2011-07-06 英特尔公司 Enabling an integrated memory controller to transparently work with defective memory devices
CN103207830A (en) * 2012-01-13 2013-07-17 上海华虹集成电路有限责任公司 Simulator with software breakpoint
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
CN103942119A (en) * 2013-12-26 2014-07-23 杭州华为数字技术有限公司 Method and device for processing memory errors

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7487397B2 (en) * 2005-10-27 2009-02-03 International Business Machines Corporation Method for cache correction using functional tests translated to fuse repair
CN101154183B (en) * 2006-09-29 2011-12-28 上海海尔集成电路有限公司 Microcontroller built-in type on-line simulation debugging system
US9009574B2 (en) * 2011-06-07 2015-04-14 Marvell World Trade Ltd. Identification and mitigation of hard errors in memory systems
US9250990B2 (en) * 2013-09-24 2016-02-02 Intel Corporation Use of error correction pointers to handle errors in memory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117236A (en) * 2008-12-30 2011-07-06 英特尔公司 Enabling an integrated memory controller to transparently work with defective memory devices
CN103207830A (en) * 2012-01-13 2013-07-17 上海华虹集成电路有限责任公司 Simulator with software breakpoint
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
CN103942119A (en) * 2013-12-26 2014-07-23 杭州华为数字技术有限公司 Method and device for processing memory errors

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256502A (en) * 2020-09-25 2021-01-22 新华三半导体技术有限公司 Memory performance test method, device and chip
CN112256502B (en) * 2020-09-25 2023-11-21 新华三半导体技术有限公司 Memory performance testing method, device and chip
CN114780283A (en) * 2022-06-20 2022-07-22 新华三信息技术有限公司 Fault processing method and device
CN114780283B (en) * 2022-06-20 2022-11-01 新华三信息技术有限公司 Fault processing method and device
CN117440166A (en) * 2023-09-19 2024-01-23 北京麟卓信息科技有限公司 Video coding and decoding mode detection method based on memory access characteristic analysis
CN117440166B (en) * 2023-09-19 2024-04-26 北京麟卓信息科技有限公司 Video coding and decoding mode detection method based on memory access characteristic analysis

Also Published As

Publication number Publication date
CN107516547A (en) 2017-12-26

Similar Documents

Publication Publication Date Title
WO2017215377A1 (en) Method and device for processing hard memory error
US10025649B2 (en) Data error detection in computing systems
KR101374455B1 (en) Memory errors and redundancy
JP4617405B2 (en) Electronic device for detecting defective memory, defective memory detecting method, and program therefor
EP2787440B1 (en) Information processing device, program, and method
CN107315616B (en) Firmware loading method and device and electronic equipment
WO2021135280A1 (en) Data check method for distributed storage system, and related apparatus
JP4387968B2 (en) Fault detection apparatus and fault detection method
CN112053737B (en) Online parallel processing soft error real-time error detection and recovery method and system
US8176388B1 (en) System and method for soft error scrubbing
US9037948B2 (en) Error correction for memory systems
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN114385418A (en) Protection method, device, equipment and storage medium for communication equipment
JP2009295252A (en) Semiconductor memory device and its error correction method
CN115729477A (en) Distributed storage IO path data writing and reading method, device and equipment
CN115421960A (en) UE memory fault recovery method, device, electronic equipment and medium
CN111625199A (en) Method and device for improving reliability of data path of solid state disk, computer equipment and storage medium
CN111352754A (en) Data storage error detection and correction method and data storage device
CN113448760B (en) Method, system, equipment and medium for recovering abnormal state of hard disk
CN117453146B (en) Data reading method, system, eFlash controller and storage medium
CN115509800B (en) Metadata verification method, system, computer equipment and storage medium
US9921906B2 (en) Performing a repair operation in arrays
TWI594262B (en) Method for Testing Memory Module
CN113626246A (en) Single-bit overturning fast repairing method and device, computer equipment and storage medium
CN117234789A (en) Verification and error correction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17812488

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17812488

Country of ref document: EP

Kind code of ref document: A1