CN109408273A - A kind of failure memory of eliminating is to the method and device of systematic influence - Google Patents

A kind of failure memory of eliminating is to the method and device of systematic influence Download PDF

Info

Publication number
CN109408273A
CN109408273A CN201811348057.5A CN201811348057A CN109408273A CN 109408273 A CN109408273 A CN 109408273A CN 201811348057 A CN201811348057 A CN 201811348057A CN 109408273 A CN109408273 A CN 109408273A
Authority
CN
China
Prior art keywords
failure
memory
physical page
recovery
operation system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811348057.5A
Other languages
Chinese (zh)
Inventor
刘浩君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201811348057.5A priority Critical patent/CN109408273A/en
Publication of CN109408273A publication Critical patent/CN109408273A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention discloses a kind of elimination failure memory to the method and device of systematic influence, when there is memory to break down, starts MCA Recovery process, generates fault log, and capture impacted physical page;MCA Recovery process is according to fault log and captures impacted physical page, and recovery operation system normally executes.The present invention is when detecting memory failure, automatic running MCA Recovery process, which makes server have a fault-tolerant ability, and permission system is continued to run when not correcting mistake detecting, guarantees system not delay machine, the normal operation of safeguard service device business;Meanwhile fault message is recorded, it when server leaves unused, replaces and safeguards for failure, improve the quality and stability of product.

Description

A kind of failure memory of eliminating is to the method and device of systematic influence
Technical field
The present invention relates to memory failure fields, and in particular to a kind of failure memory of eliminating is to the method and dress of systematic influence It sets.
Background technique
Memory is to the stability of server and its important, but memory often will appear various failures in use, cause be System delay machine, it usually needs restarting server or replacement memory can just solve the problems, such as, make system restore normal and execute, this is seriously affected Server operation, makes troubles to business operation.
Summary of the invention
To solve the above problems, the present invention provides a kind of elimination failure memory to the method and device of systematic influence, allow System not delay machine, the normal operation of safeguard service device business.
The technical scheme is that a kind of failure memory of eliminating is to the method for systematic influence, comprising:
Initialization operation system;
Detect whether that memory breaks down;
When there is memory to break down, start MCA Recovery process, generates fault log, and capture impacted Physical Page Face;
MCA Recovery process is according to fault log and captures impacted physical page, and recovery operation system is normally held Row.
Further, the MCA Recovery process according to fault log and captures impacted physical page and reflects It penetrates, recovery operation system normally executes, comprising:
Fault message is notified operating system by MCA Recovery process;
Operating system analyzes fault log, and verifies and whether feasible restore;
If recovery is feasible, after impacted physical page is done processed offline, this physical page is loaded into new physical page, Recovery operation system normally executes.
Further, the operating system analyzes fault log, and verifies and whether feasible restore, comprising:
It checks memory failure reason, judges whether failure cause meets recovery condition, if meeting recovery condition, restore feasible.
Further, the initialization operation system, comprising:
Opening operation system Poison mode;
Open MCELOG finger daemon;
Initialize EIGN.
Invention additionally discloses a kind of elimination failure memories to the device of systematic influence, comprising:
Initialization module: it is used for initialization operation system;
Fault detection module: for detecting whether there is memory to break down;
Failure response module: for when there is memory to break down, starting MCA Recovery process, generates fault log, and Capture impacted physical page;
Recovery operation system module: according to fault log and impacted Physical Page is captured for MCA Recovery process Face, recovery operation system normally execute.
Further, the recovery operation system module, comprising:
Fault message notifies submodule: fault message being notified operating system for MCA Recovery process;
Fault log analyzes submodule: analyzing fault log for operating system, and verifies and whether feasible restore;
Restore implementation sub-module: when for restoring feasible, after impacted physical page is done processed offline, by this physical page It is loaded into new physical page, recovery system normally executes.
Further, the fault log analyzes submodule, is specifically used for checking memory failure reason, judges failure cause Whether satisfaction, which restores condition, is restored feasible if meeting recovery condition.
Further, the initialization module, comprising:
Poison mode opens submodule: being used for opening operation system Poison mode;
MCELOG finger daemon opens submodule: for opening MCELOG finger daemon;
EIGN initialization submodule: for initializing EIGN.
Failure memory provided by the invention of eliminating is to the method and device of systematic influence, when detecting memory failure, from Dynamic operation MCA Recovery process, the process make server have a fault-tolerant ability, and permission system does not correct mistake detecting When continue to run, guarantee system not delay machine, the normal operation of safeguard service device business;Meanwhile fault message is recorded, it is taking When business device is idle, replaces and safeguard for failure, improve the quality and stability of product.
Detailed description of the invention
Fig. 1 is one method flow schematic diagram of the specific embodiment of the invention.
Specific embodiment
The present invention will be described in detail with reference to the accompanying drawing and by specific embodiment, and following embodiment is to the present invention Explanation, and the invention is not limited to following implementation.
" MCA Recovery " characteristic makes server have " fault-tolerant " ability, permission system detect do not correct mistake when It continues to run, the application carries out eliminating influence of the failure memory to system using MCA Recovery, when avoiding memory failure, is System delay machine.
Embodiment one
As shown in Figure 1, the present embodiment provides a kind of elimination failure memories to the method for systematic influence, specifically includes the following steps:
S1-1: initialization operation system.
It should be noted that the step specifically includes following initialization operation:
S1-11: opening operation system Poison mode;
S1-12: MCELOG finger daemon is opened;
S1-13: initialization EIGN.
Wherein, the execution code of step S1-13 initialization EIGN is as follows:
Load EINJ (# modprobe einj param_extension=1 )
Mount debugfs (# mount –t debugfs none /sys/kernel/debug)
S1-2: detected whether that memory breaks down.
Real-time detection internal storage state carries out next step operation once noting abnormalities immediately.
S1-3: when there is memory to break down, starting MCA Recovery process, generates fault log, and capture by shadow Loud physical page.
The step starts MCA Recovery process, collects relevant information, provides information for the recovery of next step system.
S1-4:MCA Recovery process is according to fault log and captures impacted physical page, recovery operation system System is normal to be executed.
This step realizes that recovery operation system normally executes by following operation:
Fault message is notified operating system by S1-41:MCA Recovery process;
S1-42: operating system analyzes fault log, and verifies and whether feasible restore;
S1-43: if recovery is feasible, after impacted physical page is done processed offline, this physical page is loaded into new object The page is managed, recovery operation system normally executes.
Wherein, step S1-42 operating system analyzes fault log, and verifies and whether feasible restore, and specifically includes: in inspection Failure cause is deposited, judges whether failure cause meets recovery condition, if meeting recovery condition, restores feasible.Pass through judgement The reason of memory failure, judges that whether feasible system restores.For example, the reason of memory is incompatible or memory bank poor contact, System restores feasible, can meet the condition that recovery operation system normally executes.
In the present embodiment, fault log is also preserved, when server leaves unused, fault log is checked, is done more for failure It changes and safeguards.
Embodiment two
The present embodiment provides a kind of elimination failure memory to the device of systematic influence, it can be achieved that the above method, the device specifically wrap It includes with lower module:
Initialization module: it is used for initialization operation system;
Fault detection module: for detecting whether there is memory to break down;
Failure response module: for when there is memory to break down, starting MCA Recovery process, generates fault log, and Capture impacted physical page;
Recovery operation system module: according to fault log and impacted Physical Page is captured for MCA Recovery process Face, recovery operation system normally execute.
Wherein, initialization module completes the operation of initialization operation system, is specifically completed by following submodule:
Poison mode opens submodule: being used for opening operation system Poison mode;
MCELOG finger daemon opens submodule: for opening MCELOG finger daemon;
EIGN initialization submodule: for initializing EIGN.
In addition, recovery operation system module includes following submodule, realization system is normally executed:
Fault message notifies submodule: fault message being notified operating system for MCA Recovery process;
Fault log analyzes submodule: analyzing fault log for operating system, and verifies and whether feasible restore;
Restore implementation sub-module: when for restoring feasible, after impacted physical page is done processed offline, by this physical page It is loaded into new physical page, recovery system normally executes.
It should be noted that whether fault log analysis submodule judges failure cause for checking memory failure reason Meet recovery condition, if meeting recovery condition, restores feasible.I.e. by the reason of judging memory failure, judge that system is restored It is whether feasible.For example, the reason of memory is incompatible or memory bank poor contact, system restores feasible, can meet recovery behaviour Make the condition that system normally executes.
When detecting memory failure, automatic running MCA Recovery process, the process makes to service for this method and device Device has a fault-tolerant ability, and permission system is continued to run when not correcting mistake detecting, guarantees system not delay machine, safeguard service device The normal operation of business;Meanwhile fault message is recorded, when server leaves unused, replaces and safeguards for failure, Improve the quality and stability of product.
Disclosed above is only the preferred embodiment of the present invention, but the present invention is not limited to this, any this field What technical staff can think does not have creative variation, and without departing from the principles of the present invention made by several improvement and Retouching, should all be within the scope of the present invention.

Claims (8)

1. a kind of failure memory of eliminating is to the method for systematic influence characterized by comprising
Initialization operation system;
Detect whether that memory breaks down;
When there is memory to break down, start MCA Recovery process, generates fault log, and capture impacted Physical Page Face;
MCA Recovery process is according to fault log and captures impacted physical page, and recovery operation system is normally held Row.
2. failure memory according to claim 1 of eliminating is to the method for systematic influence, which is characterized in that the MCA Recovery process is according to fault log and captures impacted physical page mapping, and recovery operation system is normally executed, wrapped It includes:
Fault message is notified operating system by MCA Recovery process;
Operating system analyzes fault log, and verifies and whether feasible restore;
If recovery is feasible, after impacted physical page is done processed offline, this physical page is loaded into new physical page, Recovery operation system normally executes.
3. failure memory according to claim 2 of eliminating is to the method for systematic influence, which is characterized in that the operating system Fault log is analyzed, and verifies and whether feasible restores, comprising:
It checks memory failure reason, judges whether failure cause meets recovery condition, if meeting recovery condition, restore feasible.
4. failure memory according to claim 1 of eliminating is to the method for systematic influence, which is characterized in that the initialization behaviour Make system, comprising:
Opening operation system Poison mode;
Open MCELOG finger daemon;
Initialize EIGN.
5. a kind of failure memory of eliminating is to the device of systematic influence characterized by comprising
Initialization module: it is used for initialization operation system;
Fault detection module: for detecting whether there is memory to break down;
Failure response module: for when there is memory to break down, starting MCA Recovery process, generates fault log, and Capture impacted physical page;
Recovery operation system module: according to fault log and impacted Physical Page is captured for MCA Recovery process Face, recovery operation system normally execute.
6. failure memory according to claim 5 of eliminating is to the device of systematic influence, which is characterized in that the recovery operation System module, comprising:
Fault message notifies submodule: fault message being notified operating system for MCA Recovery process;
Fault log analyzes submodule: analyzing fault log for operating system, and verifies and whether feasible restore;
Restore implementation sub-module: when for restoring feasible, after impacted physical page is done processed offline, by this physical page It is loaded into new physical page, recovery system normally executes.
7. failure memory according to claim 6 of eliminating is to the device of systematic influence, which is characterized in that the fault log Submodule is analyzed, is specifically used for checking memory failure reason, judges whether failure cause meets recovery condition, restores item if meeting Part then restores feasible.
8. failure memory according to claim 5 of eliminating is to the device of systematic influence, which is characterized in that the initialization mould Block, comprising:
Poison mode opens submodule: being used for opening operation system Poison mode;
MCELOG finger daemon opens submodule: for opening MCELOG finger daemon;
EIGN initialization submodule: for initializing EIGN.
CN201811348057.5A 2018-11-13 2018-11-13 A kind of failure memory of eliminating is to the method and device of systematic influence Pending CN109408273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811348057.5A CN109408273A (en) 2018-11-13 2018-11-13 A kind of failure memory of eliminating is to the method and device of systematic influence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811348057.5A CN109408273A (en) 2018-11-13 2018-11-13 A kind of failure memory of eliminating is to the method and device of systematic influence

Publications (1)

Publication Number Publication Date
CN109408273A true CN109408273A (en) 2019-03-01

Family

ID=65473031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811348057.5A Pending CN109408273A (en) 2018-11-13 2018-11-13 A kind of failure memory of eliminating is to the method and device of systematic influence

Country Status (1)

Country Link
CN (1) CN109408273A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1560746A (en) * 2004-02-27 2005-01-05 中国人民解放军国防科学技术大学 Page transport and copy method based on operation system reverse page table
CN102222025A (en) * 2011-06-17 2011-10-19 华为数字技术有限公司 Method and device for eliminating memory failure
CN103198000A (en) * 2013-04-02 2013-07-10 浪潮电子信息产业股份有限公司 Method for positioning faulted memory in linux system
CN103197999A (en) * 2013-03-22 2013-07-10 北京百度网讯科技有限公司 Method and device for automatically positioning internal memory fault
CN105204968A (en) * 2015-11-10 2015-12-30 浪潮(北京)电子信息产业有限公司 Method and device for detecting fault memory
CN105589762A (en) * 2014-08-19 2016-05-18 三星电子株式会社 Memory Devices, Memory Modules And Method For Correction
US20180129553A1 (en) * 2014-08-19 2018-05-10 Samsung Electronics Co., Ltd. Memory devices and modules

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1560746A (en) * 2004-02-27 2005-01-05 中国人民解放军国防科学技术大学 Page transport and copy method based on operation system reverse page table
CN102222025A (en) * 2011-06-17 2011-10-19 华为数字技术有限公司 Method and device for eliminating memory failure
CN103197999A (en) * 2013-03-22 2013-07-10 北京百度网讯科技有限公司 Method and device for automatically positioning internal memory fault
CN103198000A (en) * 2013-04-02 2013-07-10 浪潮电子信息产业股份有限公司 Method for positioning faulted memory in linux system
CN105589762A (en) * 2014-08-19 2016-05-18 三星电子株式会社 Memory Devices, Memory Modules And Method For Correction
US20180129553A1 (en) * 2014-08-19 2018-05-10 Samsung Electronics Co., Ltd. Memory devices and modules
CN105204968A (en) * 2015-11-10 2015-12-30 浪潮(北京)电子信息产业有限公司 Method and device for detecting fault memory

Similar Documents

Publication Publication Date Title
US5948112A (en) Method and apparatus for recovering from software faults
US7991961B1 (en) Low-overhead run-time memory leak detection and recovery
US7409594B2 (en) System and method to detect errors and predict potential failures
WO2017063505A1 (en) Method for detecting hardware fault of server, apparatus thereof, and server
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US20050229042A1 (en) Computer boot operation utilizing targeted boot diagnostics
CN105959802A (en) Intelligent television fault information collection method and device
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
US11853150B2 (en) Method and device for detecting memory downgrade error
JP2006259869A (en) Multiprocessor system
CN110609778A (en) Method and system for storing server downtime log
CN107368384A (en) A kind of Linux server abnormal information dump system and method
CN102369513A (en) Method for improving stability of computer system and computer system
Lee et al. Measurement-based evaluation of operating system fault tolerance
CN101145983B (en) A self-diagnosis and self-discovery subsystem and method of network management system
CN103049345A (en) Magnetic disk state transition detection method and device based on asynchronous communication mechanism
CN108762886B (en) Fault detection recovery method and system for virtual machine
CN113010341A (en) Method and equipment for positioning fault memory
CN111159051B (en) Deadlock detection method, deadlock detection device, electronic equipment and readable storage medium
CN109408273A (en) A kind of failure memory of eliminating is to the method and device of systematic influence
CN116501705A (en) RAS-based memory information collecting and analyzing method, system, equipment and medium
CN114116330B (en) Server performance testing method, system, terminal and storage medium
CN115098291A (en) Method, system, storage medium and equipment for recording system restart reason
JP6164283B2 (en) Software safe stop system, software safe stop method, and program
CN114003416A (en) Dynamic memory error processing method, system, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190301