CN103218275A - Data error repairing method, device and equipment - Google Patents

Data error repairing method, device and equipment Download PDF

Info

Publication number
CN103218275A
CN103218275A CN2013101053162A CN201310105316A CN103218275A CN 103218275 A CN103218275 A CN 103218275A CN 2013101053162 A CN2013101053162 A CN 2013101053162A CN 201310105316 A CN201310105316 A CN 201310105316A CN 103218275 A CN103218275 A CN 103218275A
Authority
CN
China
Prior art keywords
memory
data
physical address
failure
error
Prior art date
Application number
CN2013101053162A
Other languages
Chinese (zh)
Other versions
CN103218275B (en
Inventor
傅汝丹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201310105316.2A priority Critical patent/CN103218275B/en
Publication of CN103218275A publication Critical patent/CN103218275A/en
Application granted granted Critical
Publication of CN103218275B publication Critical patent/CN103218275B/en

Links

Abstract

The invention discloses a data error repairing method, a data error repairing device and data error repairing equipment and belongs to the field of terminal equipment. The method comprises the following steps: judging whether a preset counter in a memory overflows, wherein the preset counter is used for counting data errors occurring to the memory; and if the preset counter overflows, according to physical addresses of the generated data errors, which are stored in the memory, determining a failure type of the memory so as to subsequently carry out corresponding repairing. According to the invention, by effectively distinguishing the failure type of the memory according to the physical addresses stored in the preset storage space and repairing according to the failure type, the conditions that a system is shut down or cannot be started and the like, which are caused by accumulation of the data errors, are avoided and services are guaranteed to be normally carried out.

Description

数据错误修复方法、装置和设备 Data error repair method, apparatus and equipment

技术领域 FIELD

[0001] 本发明涉及计算机技术领域,特别涉及一种数据错误修复方法、装置和设备。 [0001] The present invention relates to computer technology, and particularly relates to a data error repair method, apparatus and equipment.

背景技术 Background technique

[0002]内存作为计算机系统中必备的部件,通常以内存条的形式存在于不同架构的系统中。 [0002] Computer system memory as an essential component, usually present in systems of different architectures in the form of memory. 在系统运行过程中,内存可能发生硬失效或软失效。 During system operation, the memory failure may occur hard or soft failure. 硬失效是指由于硬件问题引入的无法恢复的数据错误,软失效是指由于数据跳变而引入的可以由上、下电或者重启进行恢复的数据错误。 Hard failure is unable to recover data due to hardware problems introduced errors, soft errors is due to data transitions can be introduced by the upper and lower power or restart data error recovery. 而为了维护系统的正常运行,需要对硬失效和软失效引入的数据错误进行修复。 In order to maintain the normal operation of the system, the need for hard data and soft faults introduced an error to fix it.

[0003] 现有技术中的修复方法一般通过在内存条上增加ECC (Error Checking andCorrection,错误检测和纠正)校验芯片进行,当内存的数据出现数据错误,ECC检测到该数据错误后,输出正确的数据给用户。 [0003] In the prior art repair method is generally achieved by increasing the ECC (Error Checking andCorrection, error detection and correction) on a chip calibration memory, data memory when a data error occurs, the ECC error detection to the data output the correct data to the user.

[0004] 在实现本发明的过程中,发明人发现现有技术至少存在以下问题: [0004] During the implementation of the present invention, the inventors found that the prior art has at least the following problems:

[0005] ECC仅是根据数据错误向用户输出正确的数据,而不对内存中的错误数据进行任何修复动作。 [0005] ECC error output is only the correct data to the user according to the data, without errors in the data memory for any repair operation. 通过ECC不能有效区分硬失效和软失效,进而不能对错误数据修复,使得错误数据的累积而容易造成系统挂死、系统无法启动等,影响正常业务的进行。 By ECC can not effectively distinguish between hard and soft faults, and thus can not be repaired on the erroneous data, so that the cumulative error data easily cause the system to hang dead, the system can not start, affecting normal business.

发明内容 SUMMARY

[0006] 为了解决软、硬失效的区分和处理问题,本发明实施例提供了一种数据错误修复方法、装置和设备。 [0006] In order to solve soft, hard to distinguish between failure and handling problems, the present invention provides a data error repair method, apparatus and equipment. 所述技术方案如下: The technical solutions are as follows:

[0007] 第一方面,提供了一种数据错误修复方法,所述方法包括: [0007] In a first aspect, there is provided a data error recovery, the method comprising:

[0008] 判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; [0008] It is determined whether the counter preset memory overflows, the preset data generating counter for counting errors of said memory;

[0009] 如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0009] If the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.

[0010] 结合第一方面,本发明实施例的第一种可能实现方式中,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括: In a first possible implementation of the [0010] connection with the first aspect, embodiments of the present invention, if the preset counter overflows, according to occurrence of the error data stored in the memory of a physical address, to determine the type of memory failure for subsequent repair accordingly, comprising:

[0011] 如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。 [0011] If the presence of the same physical memory address generating data stored in said physical address errors, determining the physical address corresponding to the same data error type of failure is hard failure.

[0012] 结合第一方面,本发明实施例的第二种可能实现方式中,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括: [0012] with the first aspect, the second embodiment of the present invention may be implemented in the embodiment, if the preset counter overflows, according to occurrence of the error data stored in the memory of a physical address, to determine the type of memory failure for subsequent repair accordingly, comprising:

[0013] 如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检; [0013] The same occurs if the physical address does not exist the data stored in memory in the wrong physical address, for memory inspection;

[0014] 在结束巡检之后,判断所述内存中的数据错误是否已被修复;[0015] 如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效; [0014] After the end of inspection, the data memory is determined whether an error has been fixed; [0015] if the data error is not repaired, the data error determining failure type hard failure;

[0016] 如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。 [0016] If the data has been fixed, the determined data error type is a soft failure failure.

[0017] 结合第一方面,本发明实施例的第三种可能实现方式中,如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检,包括: [0017] with the first aspect, the third embodiment of the present invention may be implemented, the physical address of the memory if the error occurs in the data stored in the same physical address does not exist, for memory inspection, comprising:

[0018] 如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址; [0018] The same occurs if the physical address does not exist in the data stored in the error memory physical address, the physical address to the preset patrol address counter;

[0019] 根据所述巡检地址对应的所述内存中的数据进行巡检。 [0019] The inspection for the inspection data to the memory address corresponding to the.

[0020] 结合第一方面,本发明实施例的第四种可能实现方式中,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复之后,所述方法还包括: [0020] with the first aspect, a fourth embodiment of the present invention may be implemented embodiment, the preset counter if the overflow occurs according to the error data stored in the memory of a physical address, to determine the type of memory failure Thereafter, for subsequent repair accordingly, the method further comprising:

[0021] 当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址; [0021] When it is determined that the type of failure is a hard memory failure, the failure to acquire the data error type is a hard physical address of the failure;

[0022] 触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。 [0022] The alarm is triggered to prompt the user to replace the type of failure is a failure of the hard physical address corresponding to the error data memory.

[0023] 结合第一方面,本发明实施例的第五种可能实现方式中,判断内存中预设计数器是否溢出之前,所述方法还包括: [0023] with the first aspect, fifth possible implementation of this embodiment of the present invention, the preset memory is determined whether the counter overflows before, the method further comprising:

[0024] 当内存中发生数据错误时,获取发生数据错误的物理地址; [0024] When a data error occurs in the memory, acquires the physical address data error occurs;

[0025] 将所述发生数据错误的物理地址存储至内存中,并对所述发生数据错误的物理地址进行数据回写。 [0025] The physical address of the data error occurs stored in the memory, and the physical address of the data error occurrence writeback data.

[0026] 第二方面,提供了一种数据错误修复装置,所述装置包括: [0026] a second aspect, there is provided a data error recovery apparatus, said apparatus comprising:

[0027] 判断模块,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; [0027] The determination module configured to determine whether the overflow counter preset memory, the preset data generating counter for counting errors of said memory;

[0028] 确定模块,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0028] The determining module for, if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.

[0029] 结合第二方面,本发明实施例的第一种可能实现方式中,所述确定模块用于如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。 [0029] combination with the second aspect, the first possible implementation of this embodiment of the present invention, the means for determining the presence of the same physical address if the physical address of the data stored in memory error occurs in determining the failure type data error corresponding to the same physical address as a hard failure.

[0030] 结合第二方面,本发明实施例的第二种可能实现方式中,所述确定模块包括: [0030] combination with the second aspect, the second embodiment of the present invention may be implemented embodiment, the determination module comprises:

[0031] 巡检单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检; [0031] The inspection unit for the same physical address of the physical address does not exist if the memory is stored in a data error occurs, for memory inspection;

[0032] 判断单元,用于在结束巡检之后,判断所述内存中的数据错误是否已被修复; [0032] determination means for, after the end of the inspection, the data memory is determined whether an error has been fixed;

[0033] 确定单元,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效; [0033] determination means for, if the data error is not repaired, the data error determining failure type hard failure;

[0034] 所述确定单元,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。 [0034] The determination unit for, if the data has been fixed, the determined data error type is a soft failure failure.

[0035] 结合第二方面,本发明实施例的第三种可能实现方式中,所述巡检单元包括: [0035] The binding of the second aspect, the third embodiment of the present invention may be implemented embodiment, the inspection unit comprises:

[0036] 巡检地址转换子单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址; [0036] polling address translation subunit, for the same physical address of the physical address does not exist if the data stored in memory error occurs, the physical address to the preset patrol address counter;

[0037] 巡检子单元,用于根据所述巡检地址对应的所述内存中的数据进行巡检。 [0037] subunit inspection, inspection for performing inspection according to the data of the memory address corresponding to the. [0038] 结合第二方面,本发明实施例的第四种可能实现方式中,所述装置还包括: [0038] combination with the second aspect, the fourth embodiment of the present invention may be implemented embodiment, the apparatus further comprising:

[0039] 硬失效物理地址获取模块,用于当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址; [0039] Failure hard physical address acquiring module, for determining when the failure memory type hard failure, the failure to acquire the data error type is the physical address of the hard failure;

[0040] 触发模块,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。 [0040] triggering module for triggering an alarm, in order to prompt the user to replace the failed hard type data error corresponding to the physical address of the memory failure.

[0041] 结合第二方面,本发明实施例的第五种可能实现方式中,所述装置还包括: [0041] combination with the second aspect, the fifth possible implementation of this embodiment of the present invention, the apparatus further comprising:

[0042] 数据错误物理地址获取模块,用于当内存中发生数据错误时,获取发生数据错误的物理地址; [0042] Physical address data error acquisition module for, when a data error occurs in the memory, acquires the physical address data error occurs;

[0043] 存储模块,用于将所述发生数据错误的物理地址存储至内存中; [0043] The storage module for storing the physical address data error occurs to the memory;

[0044] 回写模块,用于对所述发生数据错误的物理地址进行数据回写。 [0044] The write-back module, a physical address for the data to the data error occurred writeback.

[0045] 第三方面,提供了一种数据错误修复设备,所述设备包括: [0045] a third aspect, there is provided a data error recovery, the apparatus comprising:

[0046] 内存,用于存储数据以及发生数据错误的物理地址; [0046] memory, for storing data and physical address data error occurs;

[0047] 处理器,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; [0047] processor for determining whether the counter preset memory overflows, the preset data generating counter for counting errors of said memory;

[0048] 所述处理器,还用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0048] The processor is further configured to, if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.

[0049] 本发明实施例提供的技术方案带来的有益效果是: [0049] Embodiments of the invention provide a technical solution is beneficial effects:

[0050] 本发明实施例提供的数据错误修复方法、装置和设备,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0050] Data provided by the embodiments of the present invention, error recovery methods, apparatus and equipment, by judging whether the counter overflows memory preset, the preset data generating counter for counting errors of said memory; if the preset counter overflow, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly. 采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。 Using the technical solution provided in the embodiment of the present invention can effectively distinguish the type of memory failure, and repair in accordance with the type of failure, the system avoids the accumulation of data error caused by a hanging, or the like can not be started, to ensure the normal operations.

附图说明 BRIEF DESCRIPTION

[0051] 为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。 [0051] In order to more clearly illustrate the technical solutions in the embodiments of the present invention, as briefly described in the introduction to the accompanying drawings required for use in describing the embodiments. Apparently, the drawings in the following description are only some of the present invention. embodiments, those of ordinary skill in the art is concerned, without creative efforts, can derive from these drawings other drawings.

[0052] 图1是本发明实施例中提供的一种数据错误修复方法流程图; [0052] FIG. 1 is an embodiment of the data provided in the present invention is a method flowchart of an error repair;

[0053] 图2是本发明实施例中提供的一种数据错误修复方法流程图; [0053] FIG. 2 is a data embodiment of the present invention to provide a method flowchart bug fixes;

[0054] 图3是本发明实施例中提供的一种数据错误修复装置结构示意图; [0054] FIG. 3 is an embodiment of the data provided in the present invention means a schematic configuration error recovery;

[0055] 图4是本发明实施例中提供的一种数据错误修复设备结构示意图。 [0055] FIG. 4 is an embodiment of the data provided in the present invention, a schematic view of the device structure error recovery.

具体实施方式 Detailed ways

[0056] 为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。 [0056] To make the objectives, technical solutions, and advantages of the present invention will become apparent in conjunction with the accompanying drawings of the following embodiments of the present invention will be described in further detail.

[0057] 本发明实施例中,终端设备指向用户提供数据处理功能、语音和/或数据连通性的设备,包括无线终端或有线终端。 Embodiment [0057] of the present invention, directed to the user terminal device provides data processing functions of the communication device voice and / or data, comprising a wired terminal or a wireless terminal. 无线终端可以是具有无线连接功能的手持式设备、或连接到无线调制解调器的其他处理设备,经无线接入网与一个或多个核心网进行通信的移动终端。 The wireless terminal may be a handheld device having wireless connection capability, or other processing device connected to a wireless modem, a radio access network through a mobile terminal communicating with one or more core networks. 例如,无线终端可以是移动电话(或称为“蜂窝”电话)和具有移动终端的计算机。 For example, the wireless terminal may be a mobile phone (or "cellular" telephone) and a computer having a mobile terminal. 又如,无线终端也可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置。 As another example, the wireless terminal may be a portable, pocket, handheld, computer-included, or car-mounted mobile devices. 再如,无线终端可以为移动站(英文为:mobile station)、接入点(英文为:access point)、或用户装备(英文为:user equipment,简称UE)等。 Again, the wireless terminal may be a mobile station (English as: mobile station), the access point (English as: access point), or user equipment (English as: user equipment, abbreviated UE) and the like.

[0058] 图1是本发明实施例中提供的一种数据错误修复方法流程图,本发明实施例的执行主体是终端设备,参见图1,该方法包括: [0058] FIG. 1 is an embodiment of the data provided in the present invention is a method flowchart bug fixes, execution subject embodiment of the present invention is a terminal device, see FIG. 1, the method comprising:

[0059] 101:判断内存中预设计数器是否溢出,所述预设计数器用于存储所述内存发生数据错误的物理地址; [0059] 101: determining whether the counter overflows memory preset, the preset counter for storing the physical address of the memory data error occurs;

[0060] 其中,预设计数器是预先设置在内存中的空间,该预设计数器的大小由技术人员在设计过程中进行设定,本发明实施例对此不作具体限定。 [0060] The preset counter is preset in memory space, the size of the preset counter is set by the design process in the art, this embodiment of the present invention is not specifically limited.

[0061] 优选地,该预设计数器每隔一定时间间隔对ECC寄存器进行读取,当读取到ECC寄存器中的标识位表示内存中的数据存在错误时,将该预设计数器的值加I。 [0061] Preferably, the preset counter at predetermined time intervals ECC register is read when the read flag to the register ECC error indicates the presence of data in memory, increments the value of the preset counter I . 进一步地,每隔预设时长,该预设计数器的值减1,预设时长大于读取的时间间隔。 Further, every predetermined duration, the preset counter is decremented by the value 1, the read time is longer than a preset interval. 当预设计数器的数值超过溢出门限时,该预设计数器溢出。 When the counter value exceeds a preset threshold overflow, the preset counter overflow. 其中,该预设计数器涉及到的读取的时间间隔、预设时长以及溢出门限等参数可以由技术人员进行设置,本发明实施例对此不作具体限定。 Wherein the predetermined counter reading related to the time interval length, and the overflow threshold parameters can be set by the preset art, this embodiment of the present invention is not specifically defined.

[0062] 终端设备判断内存中预设计数器是否溢出时,可以由预设计数器的值超出溢出门限时触发相应的指令,在接收到预设计数器在溢出时触发相应的指令时,确定预设计数器溢出,否则,确定该预设计数器未溢出。 When [0062] the terminal device determines whether the counter preset memory overflows, the overflow gate may exceed the limit of the preset value of the counter triggers a corresponding instruction, upon receiving the predetermined instruction counter triggers a corresponding overflow, determining preset counter overflow, otherwise, it determines that the preset counter has not overflowed.

[0063] 优选地,该数据错误为单比特错误,当预设计数器溢出时,需要判断内存中存储的发生单比特数据错误的物理地址对应的数据的失效类型,并加以处理,以防止多比特数据错误的发生;当预设计数器未溢出时,内存中存储的发生单比特数据错误的物理地址数量较少,可以不对内存中存储的物理地址对应的数据进行处理。 [0063] Preferably, the error data is a single bit error, when a preset counter overflows, need to determine the type of failure occurs is stored in memory single-bit physical address data corresponding to the error data and be addressed to prevent the multi-bit data error occurs; when a preset counter does not overflow, a smaller number of memory being stored in single-bit data error physical address, the physical address may not be stored in memory corresponding to process data.

[0064] 102:如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0064] 102: If the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.

[0065] 其中,内存的失效类型分为软失效和硬失效。 [0065] wherein, the type of failure memory divided into soft and hard failures failure. 软失效的数据错误可以进行回写,即将正确的数据写至该软失效对应的物理地址中;硬失效的数据错误不能进行回写,只能通过人工的方式进行对应内存的更换。 Soft error failure data can be written back, about to write the correct data to the physical address of the corresponding software failure; the failure of hard data errors can not write back, only to be replaced by the corresponding memory artificial means.

[0066] 如果所述预设计数器溢出,终端设备需要对内存中存储的物理地址对应的内存的失效类型进行判断,以确定内存的失效类型。 [0066] If the preset counter overflows, the terminal device needs a type of failure memory physical address stored in memory corresponding to the determination to determine the type of memory failure. 终端设备可以对内存中存储的物理地址对应的数据进行反复读取,判断内存中存储的物理地址对应的数据是否被修复,如果该数据被修复,则该物理地址对应的内存的失效类型为软失效;如果该数据未被修复,则该物理地址对应的内存的失效类型为硬失效。 Terminal device to a corresponding physical address may be stored in memory is repeatedly read data, whether or not the physical address corresponding to the data stored in memory is determined to be repaired, a memory failure if the data type is fixed, then the corresponding physical address is a soft failure; If the data is not recoverable, the type of failure memory address corresponding to the physical hard failure. 如,当对一个物理地址对应数据进行多次读取后,通过检测获知该物理地址对应的数据仍然存在错误,则该物理地址对应的内存的失效类型为硬失效。 For example, when a physical address corresponding to data to be read a plurality of times, the physical address data corresponding to the error persists, the failure memory type corresponding to the physical address is known by detecting a hard failure.

[0067] 优选地,判断内存中存储的物理地址对应的数据是否被修复可以由ECC寄存器进行检测获知。 [0067] Preferably, if the physical address corresponding to data stored in memory is determined to be repaired can be detected by the ECC register is known.

[0068] 当确定内存中存储的物理地址对应失效类型为软失效时,则将正确的数据回写至该软失效对应的物理地址中,当确定内存中存储的物理地址对应失效类型为硬失效时,则无法将正确的数据回写至该硬失效对应的物理地址中,相应的,提示用户该内存错误为硬失效,需要人工对该物理地址对应的内存进行更换,以防止多比特错误累积造成系统挂死等问题。 [0068] When the physical address is stored in memory to determine a corresponding type of failure is a soft failure, the correct data will be written back to the physical address corresponding to soft errors when the physical address stored in memory to determine a corresponding type of failure is a hard failure , the correct data can not be written back to the hard physical address corresponding to the failure, the corresponding user is prompted to the error memory is a hard failure, the memory needs to be replaced doing the corresponding physical address, in order to prevent the accumulation of multi-bit error cause the system to hang death and other issues.

[0069] 本发明实施例提供的数据错误修复方法,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0069] Data provided by the embodiments of the present invention, error recovery method, whether the overflow counter is determined by a preset memory, the preset data generating counter for counting errors of said memory; if the preset counter overflows, according to the occurrence of a data error stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly. 采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。 Using the technical solution provided in the embodiment of the present invention can effectively distinguish the type of memory failure, and repair in accordance with the type of failure, the system avoids the accumulation of data error caused by a hanging, or the like can not be started, to ensure the normal operations.

[0070] 图2是本发明实施例中提供的一种数据错误修复方法流程图,本发明实施例的执行主体是终端设备,数据错误为单比特数据错误为例进行说明。 [0070] FIG. 2 is a data embodiment of the present invention to provide a method flowchart bug fixes, execution subject embodiment of the present invention is a terminal device, data error is a single error bit data will be described as an example. 参见图2,该方法包括: Referring to Figure 2, the method comprising:

[0071] 201:当内存中发生数据错误时,获取发生数据错误的物理地址; [0071] 201: data memory when an error occurs, the physical address obtained data error occurs;

[0072] 当内存中发生单比特数据错误时,终端设备根据ECC检测到的单比特数据错误,获取单比特数据错误对应的物理地址。 [0072] When the memory single-bit data error, a single-bit data terminal equipment detected by the ECC error, the data acquisition unit corresponds to the error bit physical address.

[0073] 202:将所述发生数据错误的物理地址存储至内存中,并对所述发生数据错误的物理地址进行数据回写; [0073] 202: the physical address of the data error occurrence stored in the memory, and the physical address of the data error occurrence write-back data;

[0074] 具体地,终端设备在获取到发生数据错误的物理地址后,将该发生数据错误的物理地址存储至内存的同时,启动需求清除(Demand Scrubbing)功能,在发生单比特数据错误的物理地址中回写正确的数据,以实现对单比特数据错误的修复。 [0074] Specifically, the terminal device after obtaining the physical address data error occurs, the error data stored in a physical address of the memory at the same time, start the clear demand (Demand Scrubbing) function, data in a physical single bit error occurs writeback address the correct data, in order to achieve single-bit data errors fixed.

[0075] 当该发生的单比特数据错误对应的失效类型为软失效时,理想状态下,该需求清除功能可以将正确的数据回写,当该发生的单比特数据错误对应的失效类型为硬失效时,该需求清除功能不能将正确的数据回写。 [0075] When a single-bit data of the occurrence of an error corresponding to the type of failure is a soft failure, ideally, the demand clearance correct data can be written back, when a single-bit data error corresponding to the occurrence of the failure type hard when failure, the demand clear function can not write the correct data back. 因此,终端设备需要进一步判断发生单比特数据错误的内存的失效类型,如果是软失效,则单比特数据错误已经被修复,如果是硬失效,则单比特数据错误未被修复,还需后续进一步处理。 Thus, the terminal device needs to further determine whether the occurrence of single-bit memory data error type of failure, if a soft fail, then single-bit data error has been repaired, if a hard failure, the single-bit error data is not fixed, the need for further follow-up deal with.

[0076] 203:判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数,如果是,执行步骤204,如果否,继续执行该步骤203 ; [0076] 203: determining whether the counter overflows memory preset, the preset data generating counter for counting the error memory, if so, step 204 is performed, and if not, proceed to step 203;

[0077] 当预设计数器的值大于溢出门限,该预设计数器溢出。 [0077] When the preset counter value is greater than the overflow threshold, the predetermined counter overflow.

[0078] 需要说明的是,当预设计数器未溢出时,则不进行后续步骤,终端设备继续判断预设计数器是否溢出。 [0078] Incidentally, when a preset counter does not overflow, the subsequent step is not performed, the terminal device continues to determine whether the overflow counter is preset.

[0079] 步骤201-202为可选步骤,将所述发生数据错误的物理地址存储至内存可以作为步骤203的触发条件,当对发生数据错误的物理地址进行存储时,执行步骤203,而在本发明提供的另一实施例中,还可以是每隔预设时长执行步骤203,而不以对发生单比特错误并对发生数据错误的物理地址进行存储为触发条件。 [0079] Step 201-202 are optional step of storing a physical address of the data error can occur as a trigger to the memory in step 203, when an error occurs on the physical address data is stored, step 203, and in another embodiment of the present invention provides in embodiments, may also be a long step 203 is executed every predetermined time, without the occurrence of single-bit errors in a data error occurs and the physical address is stored as a trigger condition.

[0080] 204:判断所述内存中存储的发生数据错误的物理地址中是否存在相同的物理地址,如果是,执行步骤205,如果否,执行步骤206 ; [0080] 204: determining whether there is the same physical memory address generating said error data stored in a physical address, and if so, step 205 is executed, and if not, step 206 is executed;

[0081] 如果预设计数器溢出,终端设备读取该内存中存储的各个发生数据错误的物理地址,判断各个发生数据错误的物理地址中是否存在相同的物理地址。 [0081] If the preset counter overflows, the terminal device reads the data error occurrence respective physical address stored in the memory, determines whether there is the same physical address of each data error occurred in the physical address.

[0082] 如果判断过程中确定存在两个或者两个以上相同的物理地址,则可以确定内存中存在相同的物理地址,如果不存在两个或者两个以上相同的物理地址,则可以确定内存中不存在相同的物理地址。 [0082] If it is determined in the presence of the same two or more physical address determination process, it may be determined the same physical address stored in memory, if the same two or more physical address does not exist, it may be determined memory the same physical address does not exist.

[0083] 205:如果所述内存中存在相同的物理地址,确定所述相同的物理地址对应的单比特数据错误的失效类型为硬失效,执行步骤211 ; [0083] 205: If the same physical address is present in said memory, said single-bit data to determine the physical address corresponding to the same type of failure error is hard failure, step 211 is performed;

[0084] 根据步骤202可知,对于软失效来说,在获取到发生单比特数据错误的物理地址后,将该发生单比特数据错误的物理地址存储至内存的同时,启动需求清除(DemandScrubbing)功能,通过需求清除功能在发生单比特数据错误的物理地址中写入正确的数据。 [0084] The step 202 shows that for soft errors, after having acquired the occurrence of single-bit physical address data error, the occurrence of single-bit data to the wrong physical memory address stored at the same time, start demand clear (DemandScrubbing) Function , through the clearance needs to write the correct data in the event of single-bit physical address data error. 因此,如果该发生单比特数据错误的失效类型为软失效时,该发生单比特数据错误的物理地址中将被写入正确的数据,当再次对该物理地址进行检测时,该物理地址中的数据正确,则不将该物理地址写入内存,也即是发生软失效的物理地址仅会在内存中存储一次;而如果该发生单比特数据错误的失效类型为硬失效时,由于数据回写不能将正确的数据写入该发生单比特数据错误的物理地址中,导致该物理地址的错误数据未能被修复,当再次对该物理地址进行检测时,该物理地址将再次被写入内存,因此,当该发生单比特数据错误的失效类型为硬失效时,该发生单比特数据错误的物理地址可能会多次存储在内存中。 Thus, if the single-bit data error type of failure that occurs as soft errors, which occur in the correct data is written in single-bit physical address data error, when the physical address is detected again, the physical addresses data is correct, the physical address is not written to memory, that is, the physical address generating soft errors stored only once in memory; and if the single bit data error occurs type of failure is a hard failure, since the data is written back correct data can be written to the occurrence of single-bit physical address data error, cause erroneous data to the physical address it could not be repaired, detecting again when the physical address, the physical address will be written to memory again, Thus, when the single-bit data error occurs type of failure is a hard failure, the occurrence of single-bit data error may be many times the physical address stored in memory.

[0085] 由于发生单比特数据错误的物理地址可能会多次存储在内存中,如果内存中存在两个或两个以上相同的物理地址,确定该相同的物理地址对应的单比特数据错误的失效类型为硬失效。 [0085] Since the single-bit physical address data error occurs may be stored in memory a plurality of times, if the same two or more physical addresses in memory, it is determined that the single-bit data corresponding to the same physical address error failure type a hard failure.

[0086] 206:如果所述内存中不存在相同的物理地址,将所述内存中的发生数据错误的物理地址转换成巡检地址; [0086] 206: If the same physical address does not exist in the memory, the data transition occurs in said error memory address into a physical address of the inspection;

[0087] 当终端设备启动巡检清除Patrol Scrubbing功能时,根据转换的巡检地址对该巡检地址对应的内存数据进行巡检。 [0087] When the terminal device to start patrol Patrol Scrubbing clearing function, for patrol inspection data according to the memory address translation corresponding to the address of the inspection. 终端设备将内存中的发生数据错误的物理地址转换成巡检地址,便于根据该巡检地址进行巡检。 Data conversion occurs terminal device memory address into a physical inspection error address, facilitate inspection based on the inspection address.

[0088] 具体地,将所述内存中的发生数据错误的物理地址转换成巡检地址包括:终端设备判断内存中的发生数据错误的物理地址是否为内存地址,如果确定该内存中的发生数据错误的物理地址是内存地址,则读取DRAM_RULE寄存器确定该内存所在的socket ;查询TADO-TAD11寄存器确定Channel ID ;根据RIRWAYNESS寄存器和riri IvXoffset可以确定故障的DIMM、Rank ID 和Rank 内部地址,根据获取的socket ID、Channel ID、DIMM、RankID和Rank地址获取巡检地址。 [0088] Specifically, the data transition occurs in said error memory address into a physical address inspection comprising: a data terminal device determines the occurrence of an error in the physical memory address is the memory address, memory data in the event of determination the physical address is wrong memory address register is read DRAM_RULE determined that where the memory socket; query TADO-TAD11 register determines Channel ID; the DIMM RIRWAYNESS riri IvXoffset register and may determine a failure, and Rank ID Rank internal address according to the obtained the socket ID, Channel ID, DIMM, RankID Rank address acquisition and inspection address. 终端设备根据物理地址获取巡检地址的过程为本领域技术人员所熟知,本发明实施例不再赘述。 Inspection apparatus acquires the terminal address of the physical address of the processes known to those skilled in the art, not repeated embodiment embodiment of the present invention.

[0089] 如果内存中不存在相同的物理地址,则该内存中存储的物理地址对应的单比特数据错误的类型可能是软失效也可能是硬失效。 [0089] If the same physical memory address does not exist, the single-bit physical address of the data stored in memory corresponding to the type of error may be hard to soft errors may also fail.

[0090] 207:根据所述巡检地址对应的所述内存中的数据进行巡检; [0090] 207: for inspection according to the inspection data to the memory address corresponding to;

[0091] 具体地,终端设备停止系统自动的巡检清除Patrol Scrubbing,将转换后的巡检地址写入SCRUBADDRESSLO寄存器和SCRUBADDRESSHI寄存器,使能巡检,根据SCRUBADDRESSLO寄存器和SCRUBADDRESSHI寄存器对转换后的巡检地址对应的内存中的数据的巡检。 [0091] Specifically, the terminal device stops automatically clearing patrol Patrol Scrubbing, the inspection converted address register and the write SCRUBADDRESSLO SCRUBADDRESSHI register to enable inspection, according SCRUBADDRESSLO SCRUBADDRESSHI registers and registers the converted inspection of inspection data corresponding to the address in memory.

[0092] 在巡检过程中,如果巡检地址对应的内存数据存在数据错误,对该巡检地址对应的内存中的数据进行回写;如果巡检地址对应的内存中数据正确,则不对该数据进行任何处理。 [0092] In the inspection process, if the memory inspection data corresponding to the address data errors, the inspection data corresponding to the address in write-back memory; a data memory corresponding to the address if the inspection is correct, this will not any data processing.

[0093] 步骤206-207是如果所述内存中不存在相同的物理地址,进行内存巡检的过程。 [0093] Step 206-207 are the same if the physical address does not exist in the memory, the memory for the inspection process. [0094] 208:在结束巡检之后,判断所述内存中的单比特数据错误是否已被修复,如果是,执行步骤209,如果否,执行步骤210 ; [0094] 208: After the end of inspection, determination of the single-bit memory errors whether the data has been repaired, if so, step 209 is executed, and if not, step 210 is performed;

[0095] 在终端设备对内存中和巡检地址对应的数据巡检结束后,读取ECC寄存器中的标志着是否存在单比特数据错误的标识位,如果ECC寄存器中的标识位表明巡检地址对应内存中的数据存在错误,说明内存中的单比特数据错误未被修复;如果ECC寄存器中的标识位表明巡检地址对应内存中的数据没有错误,说明内存中的单比特数据错误已被修复。 [0095] At the end of the terminal apparatus and the memory patrol inspection data corresponding to the address, reads the ECC register indicates whether there is a single flag bit data error, if the flag indicates that the inspection ECC address register data corresponding to memory errors, described in the single-bit memory error data not fixed; ECC bits if the identification register indicates the address corresponding to the inspection data in memory without error, indicating a single-bit memory errors in data that has been repaired .

[0096] 209:如果所述单比特数据错误已被修复,确定所述单比特数据错误的失效类型为软失效,结束; [0096] 209: If a single-bit data error has been repaired, a single-bit data to determine the type of failure is a soft error failure, end;

[0097] 如果根据ECC寄存器中的标识位确定内存中的单比特数据错误已被修复,说明在终端设备启动需求清除(Demand Scrubbing)功能过程中,该单比特数据错误被纠正,确定该单比特数据错误的失效类型为软失效。 [0097] If it is determined single-bit data error memory has been fixed, described in the terminal equipment start demand clearance (Demand Scrubbing) function during the single-bit data error is corrected, determined in accordance with ECC register flag the single-bit wrong data type of failure is a soft failure.

[0098] 210:如果所述单比特数据错误未被修复,确定所述单比特数据错误的失效类型为硬失效; [0098] 210: If a single-bit data error is not repaired, a single bit data error to determine the type of failure is hard failure;

[0099] 如果根据ECC寄存器中的标识位确定内存中的单比特数据错误未被修复,说明在终端设备启动需求清除(Demand Scrubbing)功能过程中,该单比特数据错误未被纠正,确定该单比特数据错误的失效类型为硬失效。 [0099] If the flag is determined ECC register memory single-bit data errors as not fixed, indicating that the terminal device starts to clear demand (Demand Scrubbing) function during the single-bit error data is not corrected, it is determined that a single error bit data type of failure is a hard failure.

[0100] 211:当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的单比特数据错误的物理地址; [0100] 211: when it is determined the type of memory is a hard failure failure, the failure to acquire the data type is a single-bit error in the physical address of the hard failure;

[0101] 当确定所述内存的失效类型为硬失效时,终端设备获取所述失效类型为硬失效的单比特数据错误的物理地址的过程可以为以下任一项: [0101] When it is determined that the type of failure is a hard memory failure, the terminal device acquires the type of a single-bit data error failure of the hard physical address of the process may be any of the following failures:

[0102] (I)终端设备通过对内存中的相同物理地址检测,当存在相同的物理地址时确定内存的失效类型为硬失效时,终端设备直接读取该相同的物理地址; [0102] (I) of the terminal device by detecting the same physical memory address, determining the type of memory is a hard failure fails when the same physical address is present, the terminal device reads directly the same physical address;

[0103] (2)终端设备通对巡检地址对应的数据进行巡检,当巡检后根据ECC寄存器中的标志位确定所述内存的失效类型为硬失效时,终端设备从操作系统OS的mcelog文件中获取该硬失效对应的数据的物理地址。 [0103] (2) via the terminal device corresponding to the address of the patrol inspection data, when the inspection determines that the type of memory is a hard failure The failure ECC flag bit register, the terminal device from the operating system OS mcelog file acquired physical address of the data corresponding to a hard failure.

[0104] 212:触发警报,以便提示用户更换所述失效类型为硬失效的单比特数据错误的物理地址对应的内存。 [0104] 212: alarm is triggered to prompt the user to replace the type of failure is a single-bit error data corresponding to the physical address of the hard failure memory.

[0105] 优选地,终端设备获取到失效类型为硬失效的单比特数据错误的物理地址后,在显示屏幕上显示所述失效类型为硬失效的单比特数据错误的物理地址,并触发警报,使得用户在获知该信息后,对失效类型为硬失效的单比特数据错误的物理地址对应的内存进行更换,避免硬失效的单比特数据错误累积,造成系统挂死,防止内存问题在单板集中复位或者升级时大量爆发。 [0105] Preferably, the terminal device after obtaining the type of failure is a failure of a single bit data errors hard physical address, is displayed on the display screen of the failure type is a single-bit physical address data error of a hard failure, and trigger an alarm, after that the user knows that the information on the type of failure is a single-bit error data corresponding to the physical address of the hard failure replace memory, hard single-bit data to avoid failure of the accumulated errors, cause the system to hang death, memory problems in the board to prevent concentrated reset, or the outbreak of a large number of upgrades.

[0106] 步骤204-212是如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复的过程。 [0106] Step 204-212 if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, the process accordingly for subsequent repair.

[0107] 需要说明的是,本发明实施例的执行主体还可以是终端设备中的内存控制器。 [0107] It should be noted that the implementation of the main embodiment of the present invention may further embodiment of a terminal device in the controller memory.

[0108] 本发明实施例提供的数据错误修复方法,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。 [0108] Data provided by the embodiments of the present invention, error recovery method, whether the overflow counter is determined by a preset memory, the preset data generating counter for counting errors of said memory; if the preset counter overflows, according to the data storage memory error occurs in the physical address, to determine the type of failure to the memory for subsequent repair accordingly using the technical solution provided in the embodiment of the present invention can effectively distinguish the type of memory failure, and in accordance with the type of failure repair the system to avoid the accumulation of data errors caused by hanging dead or can not start, etc., to ensure the normal conduct of business.

[0109] 图3是本发明实施例中提供的一种数据错误修复装置结构示意图,参见图3,该装置包括: [0109] FIG. 3 is an embodiment of the data provided in the present invention means a schematic configuration bug fixes, see FIG. 3, the apparatus comprising:

[0110] 判断模块301,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; [0110] determination module 301, a preset memory for determining whether the counter overflows, the preset data generating counter for counting errors of said memory;

[0111] 确定模块302,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0111] module 302 determines, if for the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.

[0112] 所述确定模块302用于如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。 [0112] The determining module 302 is used for the same physical address if there is a physical address of the data stored in memory error occurs, it is determined in the same physical address corresponding to the failure data error type hard failure.

[0113] 所述确定模块302包括: [0113] The determining module 302 includes:

[0114] 巡检单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检; [0114] inspection means, for the same physical address of the physical address does not exist if the memory is stored in a data error occurs, for memory inspection;

[0115] 判断单元,用于在结束巡检之后,判断所述内存中的数据错误是否已被修复; [0115] determination means for, after the end of the inspection, the data memory is determined whether an error has been fixed;

[0116] 确定单元,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效; [0116] determination means for, if the data error is not repaired, the data error determining failure type hard failure;

[0117] 所述确定单元,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。 [0117] The determination unit for, if the data has been fixed, the determined data error type is a soft failure failure.

[0118] 所述巡检单元包括: [0118] The inspection unit comprises:

[0119] 巡检地址转换子单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址; [0119] polling address translation subunit, for the same physical address of the physical address does not exist if the data stored in memory error occurs, the physical address to the preset patrol address counter;

[0120] 巡检子单元,用于根据所述巡检地址对应的所述内存中的数据进行巡检。 [0120] subunit inspection, inspection for performing inspection according to the data of the memory address corresponding to the.

[0121] 所述装置还包括: [0121] The apparatus further comprises:

[0122] 硬失效物理地址获取模块303,用于当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址; [0122] Failure hard physical address acquisition module 303, configured to, when determining the type of memory failure is hard failure, the failure to acquire the data error type is the physical address of the hard failure;

[0123] 触发模块304,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。 [0123] The triggering module 304, for triggering an alarm, in order to prompt the user to replace the failed hard type data error corresponding to the physical address of the memory failure.

[0124] 所述装置还包括: [0124] The apparatus further comprises:

[0125] 数据错误物理地址获取模块305,用于当内存中发生数据错误时,获取发生数据错误的物理地址; [0125] Physical address data error obtaining module 305, a data memory for, when an error occurs, the physical address obtained data error occurs;

[0126] 存储模块306,用于将所述发生数据错误的物理地址存储至内存中; [0126] 306 storage module, for storing the physical address data error occurs to the memory;

[0127] 回写模块307,用于对所述发生数据错误的物理地址进行数据回写。 [0127] write-back module 307, the data for the physical address of the data error occurred writeback.

[0128] 本发明实施例提供的数据错误修复装置,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0128] Data provided by the embodiments of the present invention, error recovery means determines whether the counter overflows through a preset memory, the preset data generating counter for counting errors of said memory; if the preset counter overflows, according to the occurrence of a data error stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly. 采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。 Using the technical solution provided in the embodiment of the present invention can effectively distinguish the type of memory failure, and repair in accordance with the type of failure, the system avoids the accumulation of data error caused by a hanging, or the like can not be started, to ensure the normal operations.

[0129] 需要说明的是:上述实施例提供的数据错误修复装置在数据错误修复时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。 [0129] Note that: the above-described embodiment provides a data error recovery means when a data error recovery, division of the foregoing functional modules is illustrated, in practice, may assign different from the above-described functions according to the needs the function is performed, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. 另外,上述实施例提供的数据错误修复装置与数据错误修复方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。 Further, the above-described embodiments provide a data error recovery means and the data error recovery procedure of Example belong to the same concept, embodiments of the method specific implementation process thereof will not be repeated here.

[0130] 图4是本发明实施例中提供的一种数据错误修复设备结构示意图。 [0130] FIG. 4 is an embodiment of the data provided in the present invention, a schematic view of the device structure error recovery. 参见图4,该数据错误修复设备包括:处理器401和内存402, Referring to Figure 4, the data error repair apparatus comprising: a processor 401 and memory 402,

[0131] 处理器401,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存402发生数据错误进行计数; [0131] processor 401, a preset memory for judging whether the counter overflows, the preset data generating counter for the error counting memory 402;

[0132] 处理器401,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存402的失效类型,以便后续进行相应地修复。 [0132] processor 401 for, if the preset counter overflows, according to occurrence of the error data stored in the memory of a physical address, to determine the type of failure to the memory 402 for subsequent repair accordingly.

[0133] 内存402,用于存储数据以及发生数据错误的物理地址。 [0133] The memory 402, for storing data and physical address data error occurs.

[0134] 处理器401,用于如果所述内存402中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。 [0134] processor 401 for, if the presence of the same physical memory address generating error data stored in physical address 402, it is determined in the same physical address corresponding to the failure data error type hard failure.

[0135] 处理器401,用于如果所述内存402中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检; [0135] processor 401, if for the same physical memory address generating error data stored in the physical address 402 does not exist, for memory inspection;

[0136] 处理器401,用于在结束巡检之后,判断所述内存402中的数据错误是否已被修复; [0136] processor 401, after the inspection for determining whether the data error memory 402 has been fixed;

[0137] 处理器401,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效; [0137] processor 401, the data for errors if not repaired, determining that the data error type of failure is hard failure;

[0138] 处理器401,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。 [0138] processor 401, to the data if the error has been repaired, the data determining the type of failure is a soft error failure.

[0139] 处理器401,用于如果所述内存402中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址; [0139] processor 401, if for the same physical memory address generating error data stored in the physical address 402 does not exist, the address counter preset physical inspection into addresses;

[0140] 处理器401,用于根据所述巡检地址对应的所述内存402中的数据进行巡检。 [0140] processor 401 for performing inspection according to the inspection data of the memory 402 corresponding to the address.

[0141] 处理器401,用于当确定所述内存402的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址; [0141] processor 401, for determining when the type of failure of the memory 402 is a hard failure, the failure to acquire the data error type is the physical address of the hard failure;

[0142] 处理器401,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存402。 [0142] processor 401, for triggering an alarm, in order to prompt the user to replace the type of failure is a failure of the hard physical address corresponding to the error data memory 402.

[0143] 处理器401,用于当内存402中发生数据错误时,获取发生数据错误的物理地址; [0143] processor 401, a memory data when an error occurs 402, the data acquired physical address error occurs;

[0144] 处理器401,用于将所述发生数据错误的物理地址存储至内存402中,并对所述发生数据错误的物理地址进行数据回写。 [0144] processor 401, a physical address for the data error occurred in the physical address stored in memory 402, the data error occurs and the write-back data.

[0145] 本发明实施例提供的数据错误修复设备,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 [0145] Data provided by the embodiments of the present invention, error recovery apparatus, determining whether the counter overflows through a preset memory, the preset data generating counter for counting errors of said memory; if the preset counter overflows, according to the occurrence of a data error stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly. 采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。 Using the technical solution provided in the embodiment of the present invention can effectively distinguish the type of memory failure, and repair in accordance with the type of failure, the system avoids the accumulation of data error caused by a hanging, or the like can not be started, to ensure the normal operations.

[0146] 本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。 [0146] Those of ordinary skill in the art may understand that the above embodiments all or part of the steps may be implemented by hardware, by a program instruction may be relevant hardware, the program may be stored in a computer-readable storage medium in the above-mentioned storage medium may be a read-only memory, magnetic or optical disk.

[0147] 以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 [0147] The foregoing is only preferred embodiments of the present invention, not intended to limit the present invention within the spirit and principle of the present invention, any modification, equivalent replacement, or improvement, it should be included in the present within the scope of the invention.

Claims (13)

1.一种数据错误修复方法,其特征在于,所述方法包括: 判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; 如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 A data error recovery method, characterized in that, the method comprising: determining whether the counter overflows memory preset, the preset data generating counter for counting errors of said memory; if the preset counter overflow, according to the occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.
2.根据权利要求1所述的方法,其特征在于,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括: 如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。 2. The method according to claim 1, wherein, if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent accordingly repair, comprising: presence of the same physical address occurs if the data stored in the error memory physical address, it is determined in the same physical address corresponding to the data error type of failure is hard failure.
3.根据权利要求1所述的方法,其特征在于,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括: 如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检; 在结束巡检之后,判断所述内存中的数据错误是否已被修复; 如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效; 如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。 3. The method according to claim 1, wherein, if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent accordingly repair, comprising: a physical address does not exist the same data in the event of an error stored in memory physical address, for memory inspection; after the inspection, the data error is determined whether the memory has been fixed; if the data error is not repaired, determining that the data error type of failure is hard failure; if the data has been fixed, the determined data error type is a soft failure failure.
4.根据权利要求3所述的方法,其特征在于,如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检,包括: 如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址; 根据所述巡检地址对应的所述内存中的数据进行巡检。 4. The method according to claim 3, characterized in that the physical address does not exist the same data in the event of an error stored in memory physical address, for memory inspection, comprising: a memory if the stored the same physical address of the physical address does not exist in the data error occurs, the address counter preset physical inspection into addresses; for inspection according to the inspection data to the memory address corresponding to the.
5.根据权利要求1所述的方法,其特征在于,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复之后,所述方法还包括: 当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址; 触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。 5. The method according to claim 1, wherein, if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent accordingly after the repair, the method further comprising: upon determining that the memory is a hard failure type failure, the failure to acquire the data error type is the physical address of the hard failure; alarm is triggered to prompt the user to replace the type of failure is hard failure data error corresponding to a physical address memory.
6.根据权利要求1所述的方法,其特征在于,判断内存中预设计数器是否溢出之前,所述方法还包括: 当内存中发生数据错误时,获取发生数据错误的物理地址; 将所述发生数据错误的物理地址存储至内存中,并对所述发生数据错误的物理地址进行数据回写。 6. The method according to claim 1, characterized in that, prior to determining whether the overflow counter preset memory, said method further comprising: when the data error occurred in the memory, acquires the physical address data error occurs; the data error physical address is stored into memory, and physical address of the data error occurrence data write-back.
7.一种数据错误修复装置,其特征在于,所述装置包括: 判断模块,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; 确定模块,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 A data error recovery means, wherein, said means comprising: a determining module, configured to determine whether the counter overflows memory preset, the preset data generating counter for counting errors of said memory; determining module for, if the preset counter overflows, according to occurrence of the error data stored in memory physical address to the memory to determine the type of failure, for subsequent repair accordingly.
8.根据权利要求7所述的装置,其特征在于,所述确定模块用于如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。 8. The apparatus according to claim 7, wherein said means for determining the presence of the same physical address if the physical address of the data stored in memory error occurs, determining the physical address corresponding to the same wrong data type for the failure of a hard failure.
9.根据权利要求7所述的装置,其特征在于,所述确定模块包括: 巡检单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检; 判断单元,用于在结束巡检之后,判断所述内存中的数据错误是否已被修复; 确定单元,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效; 所述确定单元,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。 9. The apparatus according to claim 7, wherein the determining module comprises: a polling means, for the same physical address of the physical address does not exist if the memory is stored in a data error occurs, for memory inspection; determination means for, after the end of the inspection, the data error is determined whether the memory has been fixed; determining means for, if the data error is not repaired, the data error determining failure type hard failure; said determining means for, if the data has been fixed, the determined data error type is a soft failure failure.
10.根据权利要求9所述的装置,其特征在于,所述巡检单元包括: 巡检地址转换子单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址; 巡检子单元,用于根据所述巡检地址对应的所述内存中的数据进行巡检。 10. The apparatus according to claim 9, wherein the inspection unit comprises: address translation subunit inspection, if the physical address of the data stored in memory error occurs in the absence of the same physical address, the physical address to the preset patrol address counter; subunit inspection, inspection for performing inspection according to the data of the memory address corresponding to the.
11.根据权利要求7所述的装置,其特征在于,所述装置还包括: 硬失效物理地址获取模块,用于当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址; 触发模块,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。 11. The apparatus according to claim 7, wherein said apparatus further comprises: a hard failure of the physical address obtaining module, when it is determined the type of memory is a hard failure failure to obtain the type of failure is hard failure physical address data error; triggering module for triggering an alarm, in order to prompt the user to replace the failed hard type data error corresponding to the physical address of the memory failure.
12.根据权利要求7所述的装置,其特征在于,所述装置还包括: 数据错误物理地址获取模块,用于当内存中发生数据错误时,获取发生数据错误的物理地址; 存储模块,用于将所述发生数据错误的物理地址存储至内存中; 回写模块,用于对所述发生数据错误的物理地址进行数据回写。 12. The apparatus according to claim 7, characterized in that said apparatus further comprising: a data error physical address obtaining module, configured to, when a data error occurs in the memory, acquires the physical address data error occurs; memory module, with in the physical address data error occurs stored in said memory; write-back module, a physical address for the data to the data error occurred writeback.
13.一种数据错误修复设备,其特征在于,所述设备包括: 内存,用于存储数据以及发生数据错误的物理地址; 处理器,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数; 所述处理器,还用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。 13. A data error recovery apparatus, wherein, said apparatus comprising: a memory for storing data and physical address data error occurs; a processor configured to determine whether the overflow counter preset memory, the preset a counter for counting an error occurs to the data memory; the processor if further configured to preset the counter overflows, according to occurrence of the error data stored in the memory of a physical address, to determine the type of memory failure for subsequent repair accordingly.
CN201310105316.2A 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment CN103218275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310105316.2A CN103218275B (en) 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310105316.2A CN103218275B (en) 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment

Publications (2)

Publication Number Publication Date
CN103218275A true CN103218275A (en) 2013-07-24
CN103218275B CN103218275B (en) 2015-11-25

Family

ID=48816095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310105316.2A CN103218275B (en) 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment

Country Status (1)

Country Link
CN (1) CN103218275B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077375A (en) * 2014-06-24 2014-10-01 华为技术有限公司 Method for handling error catalogs of nodes in CC-NUMA system and nodes
CN104750591A (en) * 2013-12-30 2015-07-01 上海威亿实业有限公司 Evidence-taking device and method for computer
CN105426288A (en) * 2015-11-10 2016-03-23 浪潮电子信息产业股份有限公司 Optimization method of memory alarm
CN106445720A (en) * 2016-10-11 2017-02-22 郑州云海信息技术有限公司 Memory error recovery method and device
CN106569734A (en) * 2015-10-12 2017-04-19 北京国双科技有限公司 Method and device for repairing memory overflow during data shuffling
WO2019061517A1 (en) * 2017-09-30 2019-04-04 华为技术有限公司 Memory fault detection method and device, and server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020016942A1 (en) * 2000-01-26 2002-02-07 Maclaren John M. Hard/soft error detection
US20110047408A1 (en) * 2009-08-20 2011-02-24 Arm Limited Handling of hard errors in a cache of a data processing apparatus
CN102968353A (en) * 2012-10-26 2013-03-13 华为技术有限公司 Fail address processing method and fail address processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020016942A1 (en) * 2000-01-26 2002-02-07 Maclaren John M. Hard/soft error detection
US20110047408A1 (en) * 2009-08-20 2011-02-24 Arm Limited Handling of hard errors in a cache of a data processing apparatus
CN102968353A (en) * 2012-10-26 2013-03-13 华为技术有限公司 Fail address processing method and fail address processing device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750591A (en) * 2013-12-30 2015-07-01 上海威亿实业有限公司 Evidence-taking device and method for computer
CN104077375A (en) * 2014-06-24 2014-10-01 华为技术有限公司 Method for handling error catalogs of nodes in CC-NUMA system and nodes
US9652407B2 (en) 2014-06-24 2017-05-16 Huawei Technologies Co., Ltd. Method for processing error directory of node in CC-NUMA system, and node
CN106569734A (en) * 2015-10-12 2017-04-19 北京国双科技有限公司 Method and device for repairing memory overflow during data shuffling
CN106569734B (en) * 2015-10-12 2019-04-09 北京国双科技有限公司 The restorative procedure and device that memory overflows when data are shuffled
CN105426288A (en) * 2015-11-10 2016-03-23 浪潮电子信息产业股份有限公司 Optimization method of memory alarm
CN106445720A (en) * 2016-10-11 2017-02-22 郑州云海信息技术有限公司 Memory error recovery method and device
WO2019061517A1 (en) * 2017-09-30 2019-04-04 华为技术有限公司 Memory fault detection method and device, and server

Also Published As

Publication number Publication date
CN103218275B (en) 2015-11-25

Similar Documents

Publication Publication Date Title
JP4617405B2 (en) Electronic device for detecting defective memory, defective memory detecting method, and program therefor
CN103930878B (en) Method, Apparatus and system for memory verification
RU2648608C2 (en) Telemetry system for the cloud system of synchronization
JP2007109238A (en) System and method for logging recoverable error
CN100422943C (en) Latent error detection
US20030110426A1 (en) Apparatus and method for error logging on a memory module
TWI337304B (en) Method for fast system recovery via degraded reboot
JP2011040051A5 (en)
US6519717B1 (en) Mechanism to improve fault isolation and diagnosis in computers
US6393586B1 (en) Method and apparatus for diagnosing and conveying an identification code in post on a non-booting personal computer
US7707452B2 (en) Recovering from errors in a data processing system
KR20110016840A (en) Memory errors and redundancy
US20120110378A1 (en) Firmware recovery system and method of baseboard management controller of computing device
CN103415840A (en) Error management across hardware and software layers
EP2048579B1 (en) System and method for managing memory errors in an information handling system
CN103198000A (en) Method for positioning faulted memory in linux system
US8661306B2 (en) Baseboard management controller and memory error detection method of computing device utilized thereby
US20060112306A1 (en) Method and apparatus for classifying memory errors
US20060212754A1 (en) Multiprocessor system
US20050081090A1 (en) Method for automatically and safely recovering BIOS memory circuit in memory device including double BIOS memory circuits
CN101377744B (en) A terminal device recovery method and apparatus for software upgrade
CN105723348A (en) Detection of unauthorized memory modification and access using transactional memory
TW201346530A (en) Machine check summary register
CN101364193A (en) BIOS automatic recovery method and computer and system using the method
US7596648B2 (en) System and method for information handling system error recovery

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model