CN108958993B - Linux-based online memory detector MEMDOG - Google Patents
Linux-based online memory detector MEMDOG Download PDFInfo
- Publication number
- CN108958993B CN108958993B CN201710351727.8A CN201710351727A CN108958993B CN 108958993 B CN108958993 B CN 108958993B CN 201710351727 A CN201710351727 A CN 201710351727A CN 108958993 B CN108958993 B CN 108958993B
- Authority
- CN
- China
- Prior art keywords
- memory
- user space
- timer
- linux
- reliable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
The invention discloses an online memory detector MEMDOG based on Linux, which mainly comprises four parts: a detection algorithm framework, a reliable memory pool, application program memory migration and a timer. The detection algorithm framework requests a memory from the Linux memory manager, and the memory detection algorithm selected by a user is used for detecting the requested memory; the reliable memory pool stores memories which are detected to have no errors through a memory detection algorithm, and the application program needs to acquire the memories from the memory pool, so that the memories used by the application program are all detected; memory errors occur over time, so the application program memory migration part periodically migrates the data and codes of the application program from the expired memory to the recently detected memory, which needs a timer; the timer has another function of periodically updating the reliable memory pool and releasing expired memory in the memory pool to the Linux system. The MEMDOG online memory detector solves the problem that the application program is influenced by memory errors.
Description
Technical Field
The invention belongs to the field of computer software, belongs to the field of operating systems, and relates to MEMDOG (MeMDOG), namely an online memory detector based on Linux.
Background
With the smaller and smaller size and larger capacity of the memory chip, the size of the memory chip is continuously reduced, which means that the memory cell of one bit is continuously reduced, and the memory chip is more susceptible to errors caused by the influence of external factors (high temperature, dust, cosmic rays, etc.). The most common of these errors is the occurrence of a flip (flip) of one or more bits in the memory cell, which typically causes operating system and application crashes; what is more harmful is that a silent error, that is, the memory in which the error occurs is used by the program but is not discovered, and the program still continues to run, which may cause the running result of the program to be uncertain, and the cause of the error is difficult to find.
The engineering Bianca Schroeder of Google tracks a large number of machines in a Google machine room in the period of 2.5 years from 1 month in 2006 to 6 months in 2008 so as to count the probability of memory errors, and research results show that more than 8% of memory chips are affected by hardware errors every year; results of a study by the engineer Edmund b. Therefore, the research on the reliability of the memory is of great significance, and the currently existing memory detectors are few and low in efficiency, so that the Linux-based online memory detector is developed.
Disclosure of Invention
The invention provides an online memory detector MEMDOG based on Linux, which solves the problem that an application program is influenced by memory errors.
In order to solve the technical problems, the invention adopts the following technical scheme: an online memory detector MEMDOG based on Linux comprises a Linux memory manager, a detection algorithm frame, a detection algorithm, a user space interface a, an error memory collector, a user space memory error reporting program, a reliable memory pool timer, a user space interface b, all processes in a system, an application program memory migration timer, a user space interface c, a master switch and a user space interface d, and is characterized in that: the Linux memory manager is a memory management subsystem in a Linux operating system; the detection algorithm framework is a container for collecting various detection algorithms; the detection algorithm is collected in a detection algorithm frame and is used for detecting whether the memory contains errors or not; the user space interface a is a configuration file for a user to select a detection algorithm; the error memory collector is a linked list used for collecting the memory containing errors; the user space memory error reporting program is a program used for reporting memory errors found in the detection process to a user; the reliable memory pool is a linked list which collects memories which are detected by a detection algorithm and have no errors and distributes the memories for the application program; the reliable memory pool timer is a timer for periodically cleaning expired memories in the reliable memory pool; the user space interface b is a configuration file for a user to set a reliable memory pool timer period and a memory expiration period in the reliable memory pool; all processes in the system refer to all application programs in the system; the application program memory migration timer is a timer which periodically migrates data and codes in a memory used by an application program and exceeding a certain period to a recently detected memory; the user space interface c is a timer for setting the period of a memory migration timer and the expiration period of a memory used by an application program; the main switch is a switch for opening and closing the MEMDOG; the user space interface d is a user space interface for operating the MEMDOG main switch.
The working principle of the technical scheme of the invention is as follows: a user sets a memory detection algorithm used by the MEMDOG through a user space interface a, a detection algorithm framework requests a memory from a Linux memory manager, and the memory detection algorithm selected by the user is used for detecting the requested memory. If the detection algorithm detects that the memory contains errors, the memory containing the errors is put into an error memory collector to ensure that the memories can not be used by the program any more, and the user space memory error report program is used for reporting the errors; if the detected memory does not contain errors, the memory is put into a reliable memory pool for the application program to use, and the reliability of the application program using the memory is ensured. Because the memory will be in error along with the increase of time, the reliable memory pool timer will periodically release the memory exceeding a certain time limit in the reliable memory pool, and the time period of the reliable memory pool timer and the expiration limit of the memory can be set by the user through the user space interface b. The memory migration timer can periodically migrate the data and codes in the memory used by the application program and exceeding a certain period, and migrate the data and codes to the recently detected memory, so as to prevent the memory from being used for too long and causing errors, and the time period of the memory migration timer and the expiration period of the memory used by the application program can be set by a user through the user space interface c. The MEMDOG can be opened and closed through the user space interface d.
The invention has the beneficial effects that:
1. and protecting the application program running in the Linux operating system from the influence of the memory hardware error.
2. A new and efficient memory reliability mechanism is provided, and contribution is made to the diversity of the memory reliability mechanism.
Drawings
FIG. 1 is a schematic diagram of the overall design of the present invention. In the figure, 1 is a Linux memory manager, 2 is a user space interface a, 3 is a detection algorithm, 4 is a detection algorithm framework, 5 is an error memory collector, 6 is a user space memory error reporting program, 7 is a reliable memory pool, 8 is a reliable memory pool timer, 9 is a user space interface b, 10 is all processes in the system, 11 is an application program memory migration timer, 12 is a user space interface c, 13 is a master switch, and 14 is a user space interface d.
Detailed Description
A user selects a required detection algorithm by using a user space interface a for setting the detection algorithm, sets the period of a reliable memory pool timer and the expiration date of the memory in the reliable memory pool by using a user space interface b for setting the reliable memory pool timer, sets the memory migration period and the expiration date of the memory in an application program by using a user space interface c for setting the memory migration timer, enables an online memory detector by using a user space interface d, and then the online memory detector enters an open state to protect the application program from memory errors.
Example 1
A user enters a directory where a user space interface a for setting a detection algorithm is located through a command line tool terminal provided by a Linux operating system, and writes a March detection algorithm into the user space interface a for setting the detection algorithm; entering a directory where a user space interface b provided with a reliable memory pool timer is located, and writing a timer into the user space interface b provided with the reliable memory pool timer, wherein the time period and the memory expiration period are 3600 seconds and 7200 seconds respectively; entering a directory where a user space interface c provided with a memory migration timer is located, and writing a memory migration period and a memory expiration period into the user space interface c provided with the memory migration timer, wherein the memory migration period and the memory expiration period are 7200 seconds and 10800 seconds respectively; entering a directory where a user space interface d is located, writing 'yes' into the user space interface d, and opening an online memory detector MEMDOG; then the application program in the system enters into the MEMDOG protection state, the reliable memory pool timer is triggered once every 3600 seconds, and whether the memory in the reliable memory pool exceeds the expiration period of 7200 seconds is detected; the memory migration timer is triggered once every 7200 seconds, whether the memory used by the application program in the system exceeds the expiration limit of 10800 seconds is detected, and if the MEMDOG detects a memory error, the error is reported to the user through a user space error reporting program.
Claims (8)
1. An online memory detector (also called MEMDOG) based on Linux comprises a Linux memory manager, a detection algorithm framework, a detection algorithm user space interface a, an error memory collector, a user space memory error reporting program, a reliable memory pool timer, a user space interface b, all processes in a system, an application program memory migration timer, a user space interface c, a master switch and a user space interface d, and is characterized in that: the Linux memory manager is a memory management subsystem in a Linux operating system; the detection algorithm framework is a container for collecting various detection algorithms; the detection algorithm is collected in a detection algorithm frame and is used for detecting whether the memory contains errors or not; the user space interface a is a configuration file for a user to select a detection algorithm; the error memory collector is a linked list used for collecting the memory containing errors; the user space memory error reporting program is a program used for reporting memory errors found in the detection process to a user; the reliable memory pool is a linked list which collects memories which are detected by a detection algorithm and have no errors and distributes the memories for the application program; the reliable memory pool timer is a timer for periodically cleaning expired memories in the reliable memory pool; the user space interface b is a configuration file for a user to set a reliable memory pool timer period and a memory expiration period in the reliable memory pool; all processes in the system refer to all application programs in the system; the application program memory migration timer is a timer which periodically migrates data and codes in a memory used by an application program and exceeding a certain period to a recently detected memory; the user space interface c is a timer for setting the period of a memory migration timer and the expiration period of a memory used by an application program; the main switch is a switch for opening and closing the MEMDOG; the user space interface d is a user space interface for operating the MEMDOG main switch.
2. A Linux based on-line memory tester (also known as memdmog) as in claim 1, wherein the test algorithm framework is implemented in the Linux kernel, the test algorithm framework providing an algorithm registration interface for collecting the memory test algorithms, the test algorithms collected in the test algorithm framework being selectable by the user space interface.
3. A Linux based on-line memory checker (also known as memlog) as claimed in claim 1, wherein the fault memory collector is a linked list existing in the kernel and used to collect memory containing faults.
4. A Linux-based online memory checker (also known as memlog) as claimed in claim 1, wherein the execution of the user space memory error reporting program is kernel-triggered.
5. A Linux based on-line memory checker (also known as memdmog) as claimed in claim 1, wherein the reliable memory pool is a linked list implemented in kernel space and used to collect memory checked by the memory checking algorithm.
6. The Linux-based online memory detector (also referred to as memdogg) as claimed in claim 1, wherein the reliable memory pool timer periodically checks the memory in the reliable memory pool, and releases an expired memory, and a period of the reliable memory pool timer and an expiration date of the memory can be set by the user space interface b.
7. A Linux-based online memory detector (also referred to as memlog) as in claim 1, wherein the memory migration timer periodically checks the memory used by the applications in the system, migrates the data and codes in the memory used by the applications for a period exceeding a certain period into the recently detected memory, and the period and memory expiration period of the memory migration timer can be set by the user space interface c.
8. A Linux-based online memory detector (also referred to as memdogo) as claimed in claim 1, wherein the main switch can turn on and off the online memory detector memdogo, and the main switch can be set by the user space interface d.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710351727.8A CN108958993B (en) | 2017-05-18 | 2017-05-18 | Linux-based online memory detector MEMDOG |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710351727.8A CN108958993B (en) | 2017-05-18 | 2017-05-18 | Linux-based online memory detector MEMDOG |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108958993A CN108958993A (en) | 2018-12-07 |
CN108958993B true CN108958993B (en) | 2021-11-19 |
Family
ID=64461860
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710351727.8A Active CN108958993B (en) | 2017-05-18 | 2017-05-18 | Linux-based online memory detector MEMDOG |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108958993B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111223516B (en) * | 2019-12-26 | 2021-09-07 | 曙光信息产业(北京)有限公司 | RAID card detection method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101110044A (en) * | 2007-08-28 | 2008-01-23 | 中兴通讯股份有限公司 | Method and system for internal memory monitoring management |
CN102915276A (en) * | 2012-09-25 | 2013-02-06 | 武汉邮电科学研究院 | Memory control method for embedded systems |
CN106598871A (en) * | 2016-12-29 | 2017-04-26 | 山东鲁能智能技术有限公司 | Automatic analysis method and system for collapse file under Linux |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9501389B1 (en) * | 2015-08-20 | 2016-11-22 | International Business Machines Corporation | Test machine management |
-
2017
- 2017-05-18 CN CN201710351727.8A patent/CN108958993B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101110044A (en) * | 2007-08-28 | 2008-01-23 | 中兴通讯股份有限公司 | Method and system for internal memory monitoring management |
CN102915276A (en) * | 2012-09-25 | 2013-02-06 | 武汉邮电科学研究院 | Memory control method for embedded systems |
CN106598871A (en) * | 2016-12-29 | 2017-04-26 | 山东鲁能智能技术有限公司 | Automatic analysis method and system for collapse file under Linux |
Non-Patent Citations (2)
Title |
---|
MEI: A Light Weight Memory Error Injection Tool for Validating Online Memory Testers;Xiaoqiang Wang等;《IEEE》;20170116;全文 * |
MEMDOG:一种基于Linux的在线内存检测器;王小强;《万方数据》;20180613;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108958993A (en) | 2018-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100419695C (en) | Vectoring process-kill errors to an application program | |
Bautista-Gomez et al. | Unprotected computing: A large-scale study of dram raw error rate on a supercomputer | |
Schroeder et al. | DRAM errors in the wild: a large-scale field study | |
Hwang et al. | Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design | |
Mukherjee et al. | Cache scrubbing in microprocessors: Myth or necessity? | |
CN111552590B (en) | Detection and recovery method and system for memory bit overturning of power secondary equipment | |
WO2017079454A1 (en) | Storage error type determination | |
US7610523B1 (en) | Method and template for physical-memory allocation for implementing an in-system memory test | |
US20110252276A1 (en) | Low-overhead run-time memory leak detection and recovery | |
US7380169B2 (en) | Converting merge buffer system-kill errors to process-kill errors | |
US20100306489A1 (en) | Error management firewall in a multiprocessor computer | |
CN102272731A (en) | Apparatus, system, and method for predicting failures in solid-state storage | |
CN105224888B (en) | A kind of data of magnetic disk array protection system based on safe early warning technology | |
Siddiqua et al. | Analysis and modeling of memory errors from large-scale field data collection | |
CN112559395B (en) | Relay protection device and method based on dual-Soc storage system exception handling mechanism | |
US10095570B2 (en) | Programmable device, error storage system, and electronic system device | |
US11586496B2 (en) | Electronic circuit with integrated SEU monitor | |
Bottoni et al. | Heavy ions test result on a 65nm sparc-v8 radiation-hard microprocessor | |
Meza | Large scale studies of memory, storage, and network failures in a modern data center | |
CN108958993B (en) | Linux-based online memory detector MEMDOG | |
Radojkovic et al. | Towards resilient EU HPC systems: A blueprint | |
Dweik et al. | Reliability-aware exceptions: Tolerating intermittent faults in microprocessor array structures | |
US7315961B2 (en) | Black box recorder using machine check architecture in system management mode | |
CN105068969B (en) | Single particle effect guard system and method for digital signal processing platform framework | |
CN104167224A (en) | Method for reducing DRAM soft error |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |