CN103092739A - Memory error checking and correcting (ECC) error reporting and alarm mechanism - Google Patents

Memory error checking and correcting (ECC) error reporting and alarm mechanism Download PDF

Info

Publication number
CN103092739A
CN103092739A CN2013100188001A CN201310018800A CN103092739A CN 103092739 A CN103092739 A CN 103092739A CN 2013100188001 A CN2013100188001 A CN 2013100188001A CN 201310018800 A CN201310018800 A CN 201310018800A CN 103092739 A CN103092739 A CN 103092739A
Authority
CN
China
Prior art keywords
error
reports
ecc
reporting
alarm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100188001A
Other languages
Chinese (zh)
Inventor
张燕群
李博乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN2013100188001A priority Critical patent/CN103092739A/en
Publication of CN103092739A publication Critical patent/CN103092739A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The invention provides a memory error checking and correcting (ECC) error reporting and alarm mechanism and belongs to the technology of computer. The memory ECC error reporting and alarm mechanism comprises an Intel Boxboro-EX platform server. The memory ECC error reporting and alarm mechanism comprises the following steps: triggering a ECC error correction mechanism when errors appear in a memory, when the server is in a high capacity operation, and setting a counter through a basic input\output system (BIOS) to record error reporting times in certain time to assess risk grades of system breakdowns in the error reporting: when the risk grades are low, error reporting information is recorded and alarm is not triggered; when the risk grades are high, error reporting information is recorded, alarm is triggered, and users are reminded of maintaining systems timely. Compared with the prior art, the memory ECC error reporting and alarm mechanism contributes to eliminating breakdowns timely and guarantee healthy status of the systems.

Description

A kind of internal memory ECC alarm mechanism that reports an error
Technical field
The present invention relates to field of computer technology, a kind of risk class assessment that internal memory is reported an error specifically, the internal memory ECC that the facilitates system maintenance alarm mechanism that reports an error.
Background technology
The existing alarm mechanism that internal memory ECC is reported an error is not distinguish the risk class that ECC reports an error, and reports an error as long as ECC occurs, and BMC is trigger alarm at once, can cause bad impression to the client under this situation, and increases the pressure of safeguarding of server.Sporadic reporting an error, internal memory self can be completed error correction, can ignore on the impact of whole system, and for reporting an error of this class, concerning whole system, risk class is extremely low, can trigger alarm; To the situation that ECC reports an error occurs in a large number within a period of time, may be that certain parts of system have operated in the excessive risk state, continuing operation may be larger on the Systems balanth impact, under this state, timely trigger alarm is necessary, help in time to fix a breakdown, guarantee the system health state.
Summary of the invention
Technical assignment of the present invention is to solve the deficiencies in the prior art, and a kind of internal memory ECC of the risk class assessment that internal memory the is reported an error alarm mechanism that reports an error is provided.
Technical scheme of the present invention realizes in the following manner, this a kind of internal memory ECC alarm mechanism that reports an error, comprise Intel Boxboro-EX Platform Server, its specific implementation step is: when server moves in high capacity, erroneous trigger ECC mechanism for correcting errors appears in internal memory, by BIOS, the number of times that reports an error in a counter records certain hour is set, and assesses the risk class of the system failure when reporting an error: during low risk level, record report an error information, not trigger alarm; During high-risk grade, when record reports an error information, trigger alarm, the timely maintenance system of reminding user.
The described detailed step of assessing system failure risk class when reporting an error by BIOS is: BIOS arranges the counter that reports an error, the threshold values N of the quantity that reports an error is set simultaneously, be recorded in a fixed time period T ECC number of times that reports an error, if the quantity that reports an error in time T n does not reach threshold values N, be n<N, BIOS notice BMC only records faithfully the information of reporting an error, not trigger alarm; If quantity n surpasses threshold values N if report an error in time T in the time, i.e. n 〉=N, BIOS can be sent to BMC with the information of reporting an error, and notice BMC is when record reports an error information, and trigger alarm reminding user system breaks down, so that the user in time safeguards.
When the quantity that reports an error in described period of time T n did not reach threshold values N, BIOS notice BMC recorded faithfully after the information of reporting an error counter O reset and restarts counting.
The beneficial effect that the present invention compared with prior art produces is:
A kind of internal memory ECC of the present invention risk class assessment of alarm mechanism by internal memory is reported an error that report an error, low-risk is reported an error only do monitoring and do not do warning, excessive risk is reported an error at monitoring while trigger alarm, the maintenance times of minimizing system, the prolongation system cycle of operation, help in time to fix a breakdown, guarantee the system health state.
Description of drawings
Accompanying drawing 1 is that ECC alarm mechanism of the present invention is realized block diagram.
Embodiment
Below in conjunction with accompanying drawing, a kind of internal memory ECC of the present invention alarm mechanism that reports an error is described in detail below.
As shown in Figure 1, a kind of internal memory ECC alarm mechanism that reports an error now is provided, comprise Intel Boxboro-EX Platform Server, its specific implementation step is: when server moves in high capacity, erroneous trigger ECC mechanism for correcting errors appears in internal memory, by BIOS, the number of times that reports an error in a counter records certain hour is set, and assesses the risk class of the system failure when reporting an error: during low risk level, record report an error information, not trigger alarm; During high-risk grade, when record reports an error information, trigger alarm, the timely maintenance system of reminding user.
the described detailed step of assessing system failure risk class when reporting an error by BIOS is: BIOS arranges the counter that reports an error, the threshold values N of the quantity that reports an error is set simultaneously, be recorded in a fixed time period T ECC number of times that reports an error, if the quantity that reports an error in time T n does not reach threshold values N, be n<N, this explanation ECC reports an error and just occurs once in a while, the fully capable error correction of internal memory, under this situation, system performance and system stability do not had impact substantially, risk class is extremely low, BIOS only can issue BMC with the ECC information of reporting an error, BIOS notice BMC only records faithfully the information of reporting an error, trigger alarm not, BIOS can and restart counting with counter O reset, if quantity n surpasses threshold values N if report an error in time T in the time, be n 〉=N, this explanation internal memory within a period of time frequently reports an error, internal memory can be completed error correction, but system performance is impacted, SDDC or DDDC have even set out, perhaps memory modules breaks down, under this situation, system has been operated in abnormality, perhaps system performance reduces, continuing operation may occur crashing or other unpredictalbe consequences, BIOS can be sent to BMC with the information of reporting an error, and notice BMC is when record reports an error information, trigger alarm reminding user system breaks down, so that the user in time safeguards.

Claims (3)

1. internal memory ECC alarm mechanism that reports an error, it is characterized in that: comprise Intel Boxboro-EX Platform Server, its specific implementation step is: when server moves in high capacity, erroneous trigger ECC mechanism for correcting errors appears in internal memory, by BIOS, the number of times that reports an error in a counter records certain hour is set, assess the risk class of the system failure when reporting an error: during low risk level, record report an error information, not trigger alarm; During high-risk grade, when record reports an error information, trigger alarm, the timely maintenance system of reminding user.
2. a kind of internal memory ECC according to claim 1 alarm mechanism that reports an error, it is characterized in that: the described detailed step of assessing system failure risk class when reporting an error by BIOS is: BIOS arranges the counter that reports an error, the threshold values N of the quantity that reports an error is set simultaneously, be recorded in a fixed time period T ECC number of times that reports an error, if the quantity that reports an error in time T n does not reach threshold values N, be n<N, BIOS notice BMC only records faithfully the information of reporting an error, not trigger alarm; If quantity n surpasses threshold values N if report an error in time T in the time, i.e. n 〉=N, BIOS can be sent to BMC with the information of reporting an error, and notice BMC is when record reports an error information, and trigger alarm reminding user system breaks down, so that the user in time safeguards.
3. a kind of internal memory ECC according to claim 2 alarm mechanism that reports an error is characterized in that: when the quantity that reports an error in described period of time T n did not reach threshold values N, BIOS notice BMC recorded faithfully after the information of reporting an error counter O reset and restarts counting.
CN2013100188001A 2013-01-18 2013-01-18 Memory error checking and correcting (ECC) error reporting and alarm mechanism Pending CN103092739A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100188001A CN103092739A (en) 2013-01-18 2013-01-18 Memory error checking and correcting (ECC) error reporting and alarm mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100188001A CN103092739A (en) 2013-01-18 2013-01-18 Memory error checking and correcting (ECC) error reporting and alarm mechanism

Publications (1)

Publication Number Publication Date
CN103092739A true CN103092739A (en) 2013-05-08

Family

ID=48205341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100188001A Pending CN103092739A (en) 2013-01-18 2013-01-18 Memory error checking and correcting (ECC) error reporting and alarm mechanism

Country Status (1)

Country Link
CN (1) CN103092739A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605597A (en) * 2013-11-20 2014-02-26 中国科学院数据与通信保护研究教育中心 Configurable computer protection system and method
CN104268052A (en) * 2014-10-21 2015-01-07 浪潮电子信息产业股份有限公司 Memory Rank Spare testing method based on ITP tool
CN104317690A (en) * 2014-10-21 2015-01-28 浪潮电子信息产业股份有限公司 Memory Demand Scrub testing method based on ITP (integration test platform) tool
CN105117301A (en) * 2015-08-14 2015-12-02 杭州华为数字技术有限公司 Memory warning method and apparatus
CN105426288A (en) * 2015-11-10 2016-03-23 浪潮电子信息产业股份有限公司 Optimization method of memory alarm
CN105589789A (en) * 2015-12-25 2016-05-18 浪潮电子信息产业股份有限公司 Method for dynamically adjusting memory monitoring threshold value
CN107608634A (en) * 2017-09-25 2018-01-19 四川长虹电器股份有限公司 Android system spatial processing method
CN109101377A (en) * 2018-07-18 2018-12-28 郑州云海信息技术有限公司 A kind of test method of memory SDDC
CN109213038A (en) * 2018-08-30 2019-01-15 武汉携康智能健康设备有限公司 A kind of archives intelligent and safe management system
WO2019052208A1 (en) * 2017-09-18 2019-03-21 华为技术有限公司 Method and apparatus for memory evaluation
CN110008056A (en) * 2019-03-28 2019-07-12 联想(北京)有限公司 EMS memory management process, device, electronic equipment and computer readable storage medium
CN110008090A (en) * 2019-04-15 2019-07-12 苏州浪潮智能科技有限公司 A kind of method, apparatus and computer readable storage medium monitoring EMS memory error
CN110032869A (en) * 2019-04-19 2019-07-19 湖南科技学院 A kind of cloud computing protection early warning system based on big data
CN110308362A (en) * 2019-04-16 2019-10-08 惠科股份有限公司 Detection circuit and display panel
CN110780646A (en) * 2019-09-21 2020-02-11 苏州浪潮智能科技有限公司 Memory quality early warning method based on MES system
CN110781027A (en) * 2019-10-29 2020-02-11 苏州浪潮智能科技有限公司 Method, device and equipment for determining error reporting threshold of memory ECC (error correction code)
CN111209129A (en) * 2019-12-27 2020-05-29 曙光信息产业股份有限公司 Memory optimization method and device based on AMD platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4873687A (en) * 1987-10-05 1989-10-10 Ibm Corporation Failing resource manager in a multiplex communication system
CN1570859A (en) * 2003-07-16 2005-01-26 联想(北京)有限公司 Design method for avoiding misuse of non-ECC memory
CN101539881A (en) * 2008-03-18 2009-09-23 环达电脑(上海)有限公司 Device and method for detecting memory errors
CN102135925A (en) * 2010-12-27 2011-07-27 西安锐信科技有限公司 Method and device for detecting error check and correcting memory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4873687A (en) * 1987-10-05 1989-10-10 Ibm Corporation Failing resource manager in a multiplex communication system
CN1570859A (en) * 2003-07-16 2005-01-26 联想(北京)有限公司 Design method for avoiding misuse of non-ECC memory
CN101539881A (en) * 2008-03-18 2009-09-23 环达电脑(上海)有限公司 Device and method for detecting memory errors
CN102135925A (en) * 2010-12-27 2011-07-27 西安锐信科技有限公司 Method and device for detecting error check and correcting memory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田昕: "计算机硬件设备故障管理机制研究", 《中国优秀硕士学位论文全文库》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605597A (en) * 2013-11-20 2014-02-26 中国科学院数据与通信保护研究教育中心 Configurable computer protection system and method
CN103605597B (en) * 2013-11-20 2017-01-18 中国科学院数据与通信保护研究教育中心 Configurable computer protection system and method
CN104268052A (en) * 2014-10-21 2015-01-07 浪潮电子信息产业股份有限公司 Memory Rank Spare testing method based on ITP tool
CN104317690A (en) * 2014-10-21 2015-01-28 浪潮电子信息产业股份有限公司 Memory Demand Scrub testing method based on ITP (integration test platform) tool
CN104317690B (en) * 2014-10-21 2016-01-27 浪潮电子信息产业股份有限公司 A kind of Memory Demand Scrub method of testing based on ITP instrument
CN104268052B (en) * 2014-10-21 2016-02-03 浪潮电子信息产业股份有限公司 A kind of Memory Rank Spare method of testing based on ITP instrument
CN105117301B (en) * 2015-08-14 2018-08-14 杭州华为数字技术有限公司 A kind of method and device of memory early warning
CN105117301A (en) * 2015-08-14 2015-12-02 杭州华为数字技术有限公司 Memory warning method and apparatus
CN105426288A (en) * 2015-11-10 2016-03-23 浪潮电子信息产业股份有限公司 Optimization method of memory alarm
CN105589789A (en) * 2015-12-25 2016-05-18 浪潮电子信息产业股份有限公司 Method for dynamically adjusting memory monitoring threshold value
US11354183B2 (en) 2017-09-18 2022-06-07 Huawei Technologies Co., Ltd. Memory evaluation method and apparatus
WO2019052208A1 (en) * 2017-09-18 2019-03-21 华为技术有限公司 Method and apparatus for memory evaluation
CN109522175A (en) * 2017-09-18 2019-03-26 华为技术有限公司 A kind of method and device of memory assessment
US11868201B2 (en) 2017-09-18 2024-01-09 Huawei Technologies Co., Ltd. Memory evaluation method and apparatus
CN107608634A (en) * 2017-09-25 2018-01-19 四川长虹电器股份有限公司 Android system spatial processing method
CN109101377A (en) * 2018-07-18 2018-12-28 郑州云海信息技术有限公司 A kind of test method of memory SDDC
CN109213038A (en) * 2018-08-30 2019-01-15 武汉携康智能健康设备有限公司 A kind of archives intelligent and safe management system
CN110008056A (en) * 2019-03-28 2019-07-12 联想(北京)有限公司 EMS memory management process, device, electronic equipment and computer readable storage medium
CN110008090A (en) * 2019-04-15 2019-07-12 苏州浪潮智能科技有限公司 A kind of method, apparatus and computer readable storage medium monitoring EMS memory error
CN110308362A (en) * 2019-04-16 2019-10-08 惠科股份有限公司 Detection circuit and display panel
CN110032869A (en) * 2019-04-19 2019-07-19 湖南科技学院 A kind of cloud computing protection early warning system based on big data
CN110032869B (en) * 2019-04-19 2022-08-09 湖南科技学院 Cloud computing protection early warning system based on big data
CN110780646A (en) * 2019-09-21 2020-02-11 苏州浪潮智能科技有限公司 Memory quality early warning method based on MES system
CN110781027A (en) * 2019-10-29 2020-02-11 苏州浪潮智能科技有限公司 Method, device and equipment for determining error reporting threshold of memory ECC (error correction code)
CN110781027B (en) * 2019-10-29 2023-01-10 苏州浪潮智能科技有限公司 Method, device and equipment for determining error reporting threshold of memory ECC (error correction code)
CN111209129A (en) * 2019-12-27 2020-05-29 曙光信息产业股份有限公司 Memory optimization method and device based on AMD platform

Similar Documents

Publication Publication Date Title
CN103092739A (en) Memory error checking and correcting (ECC) error reporting and alarm mechanism
US10459815B2 (en) Method and system for predicting storage device failures
US11119874B2 (en) Memory fault detection
CN105659215B (en) A kind of fault handling method, relevant apparatus and computer
CN105843699B (en) Dynamic random access memory device and method for error monitoring and correction
CN105117301B (en) A kind of method and device of memory early warning
CN106682162B (en) Log management method and device
TWI610169B (en) Method and processor for writing, and error tracking a log subsystem of a file system
CN205881469U (en) Fault detection equipment of electronic equipment and memory that is used for having a plurality of memory locations of standing transient fault and permanent fault
CN107145410A (en) After a kind of system exception power down it is automatic on establish the method, system and equipment of machine by cable
CN109710501B (en) Method and system for detecting data transmission stability of server
CN102591591A (en) Disk detection system, disk detection method and network storage system
CN104424078A (en) Method and system for reducing floating alarm information
CN109491819A (en) A kind of method and system of diagnosis server failure
CN109002384A (en) A kind of alarm method of server failure, device, equipment and storage medium
CN105426288A (en) Optimization method of memory alarm
CN105589789A (en) Method for dynamically adjusting memory monitoring threshold value
CN102609350A (en) Server memory failure alarm method
US7269764B2 (en) Monitoring VRM-induced memory errors
WO2020000956A1 (en) Method, apparatus and device for bmc monitoring of correctable ecc errors
CN105247490B (en) For the optimization method used of the nonvolatile memory of the motor vehicles computer of building blocks of function monitoring
CN109389697A (en) Recording method, equipment and the readable storage medium storing program for executing of underground inspection data inputting time
CN108204331B (en) Fault processing method and device for wind generating set
CN106201753B (en) Method and system for processing PCIE errors in linux
CN103916272A (en) Main control single board and fault detecting method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130508

WD01 Invention patent application deemed withdrawn after publication