CN109408273A - A kind of failure memory of eliminating is to the method and device of systematic influence - Google Patents
A kind of failure memory of eliminating is to the method and device of systematic influence Download PDFInfo
- Publication number
- CN109408273A CN109408273A CN201811348057.5A CN201811348057A CN109408273A CN 109408273 A CN109408273 A CN 109408273A CN 201811348057 A CN201811348057 A CN 201811348057A CN 109408273 A CN109408273 A CN 109408273A
- Authority
- CN
- China
- Prior art keywords
- failure
- memory
- physical page
- recovery
- operation system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention discloses a kind of elimination failure memory to the method and device of systematic influence, when there is memory to break down, starts MCA Recovery process, generates fault log, and capture impacted physical page;MCA Recovery process is according to fault log and captures impacted physical page, and recovery operation system normally executes.The present invention is when detecting memory failure, automatic running MCA Recovery process, which makes server have a fault-tolerant ability, and permission system is continued to run when not correcting mistake detecting, guarantees system not delay machine, the normal operation of safeguard service device business;Meanwhile fault message is recorded, it when server leaves unused, replaces and safeguards for failure, improve the quality and stability of product.
Description
Technical field
The present invention relates to memory failure fields, and in particular to a kind of failure memory of eliminating is to the method and dress of systematic influence
It sets.
Background technique
Memory is to the stability of server and its important, but memory often will appear various failures in use, cause be
System delay machine, it usually needs restarting server or replacement memory can just solve the problems, such as, make system restore normal and execute, this is seriously affected
Server operation, makes troubles to business operation.
Summary of the invention
To solve the above problems, the present invention provides a kind of elimination failure memory to the method and device of systematic influence, allow
System not delay machine, the normal operation of safeguard service device business.
The technical scheme is that a kind of failure memory of eliminating is to the method for systematic influence, comprising:
Initialization operation system;
Detect whether that memory breaks down;
When there is memory to break down, start MCA Recovery process, generates fault log, and capture impacted Physical Page
Face;
MCA Recovery process is according to fault log and captures impacted physical page, and recovery operation system is normally held
Row.
Further, the MCA Recovery process according to fault log and captures impacted physical page and reflects
It penetrates, recovery operation system normally executes, comprising:
Fault message is notified operating system by MCA Recovery process;
Operating system analyzes fault log, and verifies and whether feasible restore;
If recovery is feasible, after impacted physical page is done processed offline, this physical page is loaded into new physical page,
Recovery operation system normally executes.
Further, the operating system analyzes fault log, and verifies and whether feasible restore, comprising:
It checks memory failure reason, judges whether failure cause meets recovery condition, if meeting recovery condition, restore feasible.
Further, the initialization operation system, comprising:
Opening operation system Poison mode;
Open MCELOG finger daemon;
Initialize EIGN.
Invention additionally discloses a kind of elimination failure memories to the device of systematic influence, comprising:
Initialization module: it is used for initialization operation system;
Fault detection module: for detecting whether there is memory to break down;
Failure response module: for when there is memory to break down, starting MCA Recovery process, generates fault log, and
Capture impacted physical page;
Recovery operation system module: according to fault log and impacted Physical Page is captured for MCA Recovery process
Face, recovery operation system normally execute.
Further, the recovery operation system module, comprising:
Fault message notifies submodule: fault message being notified operating system for MCA Recovery process;
Fault log analyzes submodule: analyzing fault log for operating system, and verifies and whether feasible restore;
Restore implementation sub-module: when for restoring feasible, after impacted physical page is done processed offline, by this physical page
It is loaded into new physical page, recovery system normally executes.
Further, the fault log analyzes submodule, is specifically used for checking memory failure reason, judges failure cause
Whether satisfaction, which restores condition, is restored feasible if meeting recovery condition.
Further, the initialization module, comprising:
Poison mode opens submodule: being used for opening operation system Poison mode;
MCELOG finger daemon opens submodule: for opening MCELOG finger daemon;
EIGN initialization submodule: for initializing EIGN.
Failure memory provided by the invention of eliminating is to the method and device of systematic influence, when detecting memory failure, from
Dynamic operation MCA Recovery process, the process make server have a fault-tolerant ability, and permission system does not correct mistake detecting
When continue to run, guarantee system not delay machine, the normal operation of safeguard service device business;Meanwhile fault message is recorded, it is taking
When business device is idle, replaces and safeguard for failure, improve the quality and stability of product.
Detailed description of the invention
Fig. 1 is one method flow schematic diagram of the specific embodiment of the invention.
Specific embodiment
The present invention will be described in detail with reference to the accompanying drawing and by specific embodiment, and following embodiment is to the present invention
Explanation, and the invention is not limited to following implementation.
" MCA Recovery " characteristic makes server have " fault-tolerant " ability, permission system detect do not correct mistake when
It continues to run, the application carries out eliminating influence of the failure memory to system using MCA Recovery, when avoiding memory failure, is
System delay machine.
Embodiment one
As shown in Figure 1, the present embodiment provides a kind of elimination failure memories to the method for systematic influence, specifically includes the following steps:
S1-1: initialization operation system.
It should be noted that the step specifically includes following initialization operation:
S1-11: opening operation system Poison mode;
S1-12: MCELOG finger daemon is opened;
S1-13: initialization EIGN.
Wherein, the execution code of step S1-13 initialization EIGN is as follows:
Load EINJ (# modprobe einj param_extension=1 )
Mount debugfs (# mount –t debugfs none /sys/kernel/debug)
S1-2: detected whether that memory breaks down.
Real-time detection internal storage state carries out next step operation once noting abnormalities immediately.
S1-3: when there is memory to break down, starting MCA Recovery process, generates fault log, and capture by shadow
Loud physical page.
The step starts MCA Recovery process, collects relevant information, provides information for the recovery of next step system.
S1-4:MCA Recovery process is according to fault log and captures impacted physical page, recovery operation system
System is normal to be executed.
This step realizes that recovery operation system normally executes by following operation:
Fault message is notified operating system by S1-41:MCA Recovery process;
S1-42: operating system analyzes fault log, and verifies and whether feasible restore;
S1-43: if recovery is feasible, after impacted physical page is done processed offline, this physical page is loaded into new object
The page is managed, recovery operation system normally executes.
Wherein, step S1-42 operating system analyzes fault log, and verifies and whether feasible restore, and specifically includes: in inspection
Failure cause is deposited, judges whether failure cause meets recovery condition, if meeting recovery condition, restores feasible.Pass through judgement
The reason of memory failure, judges that whether feasible system restores.For example, the reason of memory is incompatible or memory bank poor contact,
System restores feasible, can meet the condition that recovery operation system normally executes.
In the present embodiment, fault log is also preserved, when server leaves unused, fault log is checked, is done more for failure
It changes and safeguards.
Embodiment two
The present embodiment provides a kind of elimination failure memory to the device of systematic influence, it can be achieved that the above method, the device specifically wrap
It includes with lower module:
Initialization module: it is used for initialization operation system;
Fault detection module: for detecting whether there is memory to break down;
Failure response module: for when there is memory to break down, starting MCA Recovery process, generates fault log, and
Capture impacted physical page;
Recovery operation system module: according to fault log and impacted Physical Page is captured for MCA Recovery process
Face, recovery operation system normally execute.
Wherein, initialization module completes the operation of initialization operation system, is specifically completed by following submodule:
Poison mode opens submodule: being used for opening operation system Poison mode;
MCELOG finger daemon opens submodule: for opening MCELOG finger daemon;
EIGN initialization submodule: for initializing EIGN.
In addition, recovery operation system module includes following submodule, realization system is normally executed:
Fault message notifies submodule: fault message being notified operating system for MCA Recovery process;
Fault log analyzes submodule: analyzing fault log for operating system, and verifies and whether feasible restore;
Restore implementation sub-module: when for restoring feasible, after impacted physical page is done processed offline, by this physical page
It is loaded into new physical page, recovery system normally executes.
It should be noted that whether fault log analysis submodule judges failure cause for checking memory failure reason
Meet recovery condition, if meeting recovery condition, restores feasible.I.e. by the reason of judging memory failure, judge that system is restored
It is whether feasible.For example, the reason of memory is incompatible or memory bank poor contact, system restores feasible, can meet recovery behaviour
Make the condition that system normally executes.
When detecting memory failure, automatic running MCA Recovery process, the process makes to service for this method and device
Device has a fault-tolerant ability, and permission system is continued to run when not correcting mistake detecting, guarantees system not delay machine, safeguard service device
The normal operation of business;Meanwhile fault message is recorded, when server leaves unused, replaces and safeguards for failure,
Improve the quality and stability of product.
Disclosed above is only the preferred embodiment of the present invention, but the present invention is not limited to this, any this field
What technical staff can think does not have creative variation, and without departing from the principles of the present invention made by several improvement and
Retouching, should all be within the scope of the present invention.
Claims (8)
1. a kind of failure memory of eliminating is to the method for systematic influence characterized by comprising
Initialization operation system;
Detect whether that memory breaks down;
When there is memory to break down, start MCA Recovery process, generates fault log, and capture impacted Physical Page
Face;
MCA Recovery process is according to fault log and captures impacted physical page, and recovery operation system is normally held
Row.
2. failure memory according to claim 1 of eliminating is to the method for systematic influence, which is characterized in that the MCA
Recovery process is according to fault log and captures impacted physical page mapping, and recovery operation system is normally executed, wrapped
It includes:
Fault message is notified operating system by MCA Recovery process;
Operating system analyzes fault log, and verifies and whether feasible restore;
If recovery is feasible, after impacted physical page is done processed offline, this physical page is loaded into new physical page,
Recovery operation system normally executes.
3. failure memory according to claim 2 of eliminating is to the method for systematic influence, which is characterized in that the operating system
Fault log is analyzed, and verifies and whether feasible restores, comprising:
It checks memory failure reason, judges whether failure cause meets recovery condition, if meeting recovery condition, restore feasible.
4. failure memory according to claim 1 of eliminating is to the method for systematic influence, which is characterized in that the initialization behaviour
Make system, comprising:
Opening operation system Poison mode;
Open MCELOG finger daemon;
Initialize EIGN.
5. a kind of failure memory of eliminating is to the device of systematic influence characterized by comprising
Initialization module: it is used for initialization operation system;
Fault detection module: for detecting whether there is memory to break down;
Failure response module: for when there is memory to break down, starting MCA Recovery process, generates fault log, and
Capture impacted physical page;
Recovery operation system module: according to fault log and impacted Physical Page is captured for MCA Recovery process
Face, recovery operation system normally execute.
6. failure memory according to claim 5 of eliminating is to the device of systematic influence, which is characterized in that the recovery operation
System module, comprising:
Fault message notifies submodule: fault message being notified operating system for MCA Recovery process;
Fault log analyzes submodule: analyzing fault log for operating system, and verifies and whether feasible restore;
Restore implementation sub-module: when for restoring feasible, after impacted physical page is done processed offline, by this physical page
It is loaded into new physical page, recovery system normally executes.
7. failure memory according to claim 6 of eliminating is to the device of systematic influence, which is characterized in that the fault log
Submodule is analyzed, is specifically used for checking memory failure reason, judges whether failure cause meets recovery condition, restores item if meeting
Part then restores feasible.
8. failure memory according to claim 5 of eliminating is to the device of systematic influence, which is characterized in that the initialization mould
Block, comprising:
Poison mode opens submodule: being used for opening operation system Poison mode;
MCELOG finger daemon opens submodule: for opening MCELOG finger daemon;
EIGN initialization submodule: for initializing EIGN.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811348057.5A CN109408273A (en) | 2018-11-13 | 2018-11-13 | A kind of failure memory of eliminating is to the method and device of systematic influence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811348057.5A CN109408273A (en) | 2018-11-13 | 2018-11-13 | A kind of failure memory of eliminating is to the method and device of systematic influence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109408273A true CN109408273A (en) | 2019-03-01 |
Family
ID=65473031
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811348057.5A Pending CN109408273A (en) | 2018-11-13 | 2018-11-13 | A kind of failure memory of eliminating is to the method and device of systematic influence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408273A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1560746A (en) * | 2004-02-27 | 2005-01-05 | 中国人民解放军国防科学技术大学 | Page transport and copy method based on operation system reverse page table |
CN102222025A (en) * | 2011-06-17 | 2011-10-19 | 华为数字技术有限公司 | Method and device for eliminating memory failure |
CN103198000A (en) * | 2013-04-02 | 2013-07-10 | 浪潮电子信息产业股份有限公司 | Method for positioning faulted memory in linux system |
CN103197999A (en) * | 2013-03-22 | 2013-07-10 | 北京百度网讯科技有限公司 | Method and device for automatically positioning internal memory fault |
CN105204968A (en) * | 2015-11-10 | 2015-12-30 | 浪潮(北京)电子信息产业有限公司 | Method and device for detecting fault memory |
CN105589762A (en) * | 2014-08-19 | 2016-05-18 | 三星电子株式会社 | Memory Devices, Memory Modules And Method For Correction |
US20180129553A1 (en) * | 2014-08-19 | 2018-05-10 | Samsung Electronics Co., Ltd. | Memory devices and modules |
-
2018
- 2018-11-13 CN CN201811348057.5A patent/CN109408273A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1560746A (en) * | 2004-02-27 | 2005-01-05 | 中国人民解放军国防科学技术大学 | Page transport and copy method based on operation system reverse page table |
CN102222025A (en) * | 2011-06-17 | 2011-10-19 | 华为数字技术有限公司 | Method and device for eliminating memory failure |
CN103197999A (en) * | 2013-03-22 | 2013-07-10 | 北京百度网讯科技有限公司 | Method and device for automatically positioning internal memory fault |
CN103198000A (en) * | 2013-04-02 | 2013-07-10 | 浪潮电子信息产业股份有限公司 | Method for positioning faulted memory in linux system |
CN105589762A (en) * | 2014-08-19 | 2016-05-18 | 三星电子株式会社 | Memory Devices, Memory Modules And Method For Correction |
US20180129553A1 (en) * | 2014-08-19 | 2018-05-10 | Samsung Electronics Co., Ltd. | Memory devices and modules |
CN105204968A (en) * | 2015-11-10 | 2015-12-30 | 浪潮(北京)电子信息产业有限公司 | Method and device for detecting fault memory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5948112A (en) | Method and apparatus for recovering from software faults | |
US7991961B1 (en) | Low-overhead run-time memory leak detection and recovery | |
US7409594B2 (en) | System and method to detect errors and predict potential failures | |
WO2017063505A1 (en) | Method for detecting hardware fault of server, apparatus thereof, and server | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
US20050229042A1 (en) | Computer boot operation utilizing targeted boot diagnostics | |
CN105959802A (en) | Intelligent television fault information collection method and device | |
CN105607973B (en) | Method, device and system for processing equipment fault in virtual machine system | |
US11853150B2 (en) | Method and device for detecting memory downgrade error | |
JP2006259869A (en) | Multiprocessor system | |
CN110609778A (en) | Method and system for storing server downtime log | |
CN107368384A (en) | A kind of Linux server abnormal information dump system and method | |
CN102369513A (en) | Method for improving stability of computer system and computer system | |
Lee et al. | Measurement-based evaluation of operating system fault tolerance | |
CN101145983B (en) | A self-diagnosis and self-discovery subsystem and method of network management system | |
CN103049345A (en) | Magnetic disk state transition detection method and device based on asynchronous communication mechanism | |
CN108762886B (en) | Fault detection recovery method and system for virtual machine | |
CN113010341A (en) | Method and equipment for positioning fault memory | |
CN111159051B (en) | Deadlock detection method, deadlock detection device, electronic equipment and readable storage medium | |
CN109408273A (en) | A kind of failure memory of eliminating is to the method and device of systematic influence | |
CN116501705A (en) | RAS-based memory information collecting and analyzing method, system, equipment and medium | |
CN114116330B (en) | Server performance testing method, system, terminal and storage medium | |
CN115098291A (en) | Method, system, storage medium and equipment for recording system restart reason | |
JP6164283B2 (en) | Software safe stop system, software safe stop method, and program | |
CN114003416A (en) | Dynamic memory error processing method, system, terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190301 |