CN108958993B - Linux-based online memory detector MEMDOG - Google Patents

Linux-based online memory detector MEMDOG Download PDF

Info

Publication number
CN108958993B
CN108958993B CN201710351727.8A CN201710351727A CN108958993B CN 108958993 B CN108958993 B CN 108958993B CN 201710351727 A CN201710351727 A CN 201710351727A CN 108958993 B CN108958993 B CN 108958993B
Authority
CN
China
Prior art keywords
memory
user space
timer
linux
reliable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710351727.8A
Other languages
Chinese (zh)
Other versions
CN108958993A (en
Inventor
周庆国
王小强
段鸣
周睿
李飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanzhou University
Original Assignee
Lanzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanzhou University filed Critical Lanzhou University
Priority to CN201710351727.8A priority Critical patent/CN108958993B/en
Publication of CN108958993A publication Critical patent/CN108958993A/en
Application granted granted Critical
Publication of CN108958993B publication Critical patent/CN108958993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The invention discloses an online memory detector MEMDOG based on Linux, which mainly comprises four parts: a detection algorithm framework, a reliable memory pool, application program memory migration and a timer. The detection algorithm framework requests a memory from the Linux memory manager, and the memory detection algorithm selected by a user is used for detecting the requested memory; the reliable memory pool stores memories which are detected to have no errors through a memory detection algorithm, and the application program needs to acquire the memories from the memory pool, so that the memories used by the application program are all detected; memory errors occur over time, so the application program memory migration part periodically migrates the data and codes of the application program from the expired memory to the recently detected memory, which needs a timer; the timer has another function of periodically updating the reliable memory pool and releasing expired memory in the memory pool to the Linux system. The MEMDOG online memory detector solves the problem that the application program is influenced by memory errors.

Description

Linux-based online memory detector MEMDOG
Technical Field
The invention belongs to the field of computer software, belongs to the field of operating systems, and relates to MEMDOG (MeMDOG), namely an online memory detector based on Linux.
Background
With the smaller and smaller size and larger capacity of the memory chip, the size of the memory chip is continuously reduced, which means that the memory cell of one bit is continuously reduced, and the memory chip is more susceptible to errors caused by the influence of external factors (high temperature, dust, cosmic rays, etc.). The most common of these errors is the occurrence of a flip (flip) of one or more bits in the memory cell, which typically causes operating system and application crashes; what is more harmful is that a silent error, that is, the memory in which the error occurs is used by the program but is not discovered, and the program still continues to run, which may cause the running result of the program to be uncertain, and the cause of the error is difficult to find.
The engineering Bianca Schroeder of Google tracks a large number of machines in a Google machine room in the period of 2.5 years from 1 month in 2006 to 6 months in 2008 so as to count the probability of memory errors, and research results show that more than 8% of memory chips are affected by hardware errors every year; results of a study by the engineer Edmund b. Therefore, the research on the reliability of the memory is of great significance, and the currently existing memory detectors are few and low in efficiency, so that the Linux-based online memory detector is developed.
Disclosure of Invention
The invention provides an online memory detector MEMDOG based on Linux, which solves the problem that an application program is influenced by memory errors.
In order to solve the technical problems, the invention adopts the following technical scheme: an online memory detector MEMDOG based on Linux comprises a Linux memory manager, a detection algorithm frame, a detection algorithm, a user space interface a, an error memory collector, a user space memory error reporting program, a reliable memory pool timer, a user space interface b, all processes in a system, an application program memory migration timer, a user space interface c, a master switch and a user space interface d, and is characterized in that: the Linux memory manager is a memory management subsystem in a Linux operating system; the detection algorithm framework is a container for collecting various detection algorithms; the detection algorithm is collected in a detection algorithm frame and is used for detecting whether the memory contains errors or not; the user space interface a is a configuration file for a user to select a detection algorithm; the error memory collector is a linked list used for collecting the memory containing errors; the user space memory error reporting program is a program used for reporting memory errors found in the detection process to a user; the reliable memory pool is a linked list which collects memories which are detected by a detection algorithm and have no errors and distributes the memories for the application program; the reliable memory pool timer is a timer for periodically cleaning expired memories in the reliable memory pool; the user space interface b is a configuration file for a user to set a reliable memory pool timer period and a memory expiration period in the reliable memory pool; all processes in the system refer to all application programs in the system; the application program memory migration timer is a timer which periodically migrates data and codes in a memory used by an application program and exceeding a certain period to a recently detected memory; the user space interface c is a timer for setting the period of a memory migration timer and the expiration period of a memory used by an application program; the main switch is a switch for opening and closing the MEMDOG; the user space interface d is a user space interface for operating the MEMDOG main switch.
The working principle of the technical scheme of the invention is as follows: a user sets a memory detection algorithm used by the MEMDOG through a user space interface a, a detection algorithm framework requests a memory from a Linux memory manager, and the memory detection algorithm selected by the user is used for detecting the requested memory. If the detection algorithm detects that the memory contains errors, the memory containing the errors is put into an error memory collector to ensure that the memories can not be used by the program any more, and the user space memory error report program is used for reporting the errors; if the detected memory does not contain errors, the memory is put into a reliable memory pool for the application program to use, and the reliability of the application program using the memory is ensured. Because the memory will be in error along with the increase of time, the reliable memory pool timer will periodically release the memory exceeding a certain time limit in the reliable memory pool, and the time period of the reliable memory pool timer and the expiration limit of the memory can be set by the user through the user space interface b. The memory migration timer can periodically migrate the data and codes in the memory used by the application program and exceeding a certain period, and migrate the data and codes to the recently detected memory, so as to prevent the memory from being used for too long and causing errors, and the time period of the memory migration timer and the expiration period of the memory used by the application program can be set by a user through the user space interface c. The MEMDOG can be opened and closed through the user space interface d.
The invention has the beneficial effects that:
1. and protecting the application program running in the Linux operating system from the influence of the memory hardware error.
2. A new and efficient memory reliability mechanism is provided, and contribution is made to the diversity of the memory reliability mechanism.
Drawings
FIG. 1 is a schematic diagram of the overall design of the present invention. In the figure, 1 is a Linux memory manager, 2 is a user space interface a, 3 is a detection algorithm, 4 is a detection algorithm framework, 5 is an error memory collector, 6 is a user space memory error reporting program, 7 is a reliable memory pool, 8 is a reliable memory pool timer, 9 is a user space interface b, 10 is all processes in the system, 11 is an application program memory migration timer, 12 is a user space interface c, 13 is a master switch, and 14 is a user space interface d.
Detailed Description
A user selects a required detection algorithm by using a user space interface a for setting the detection algorithm, sets the period of a reliable memory pool timer and the expiration date of the memory in the reliable memory pool by using a user space interface b for setting the reliable memory pool timer, sets the memory migration period and the expiration date of the memory in an application program by using a user space interface c for setting the memory migration timer, enables an online memory detector by using a user space interface d, and then the online memory detector enters an open state to protect the application program from memory errors.
Example 1
A user enters a directory where a user space interface a for setting a detection algorithm is located through a command line tool terminal provided by a Linux operating system, and writes a March detection algorithm into the user space interface a for setting the detection algorithm; entering a directory where a user space interface b provided with a reliable memory pool timer is located, and writing a timer into the user space interface b provided with the reliable memory pool timer, wherein the time period and the memory expiration period are 3600 seconds and 7200 seconds respectively; entering a directory where a user space interface c provided with a memory migration timer is located, and writing a memory migration period and a memory expiration period into the user space interface c provided with the memory migration timer, wherein the memory migration period and the memory expiration period are 7200 seconds and 10800 seconds respectively; entering a directory where a user space interface d is located, writing 'yes' into the user space interface d, and opening an online memory detector MEMDOG; then the application program in the system enters into the MEMDOG protection state, the reliable memory pool timer is triggered once every 3600 seconds, and whether the memory in the reliable memory pool exceeds the expiration period of 7200 seconds is detected; the memory migration timer is triggered once every 7200 seconds, whether the memory used by the application program in the system exceeds the expiration limit of 10800 seconds is detected, and if the MEMDOG detects a memory error, the error is reported to the user through a user space error reporting program.

Claims (8)

1. An online memory detector (also called MEMDOG) based on Linux comprises a Linux memory manager, a detection algorithm framework, a detection algorithm user space interface a, an error memory collector, a user space memory error reporting program, a reliable memory pool timer, a user space interface b, all processes in a system, an application program memory migration timer, a user space interface c, a master switch and a user space interface d, and is characterized in that: the Linux memory manager is a memory management subsystem in a Linux operating system; the detection algorithm framework is a container for collecting various detection algorithms; the detection algorithm is collected in a detection algorithm frame and is used for detecting whether the memory contains errors or not; the user space interface a is a configuration file for a user to select a detection algorithm; the error memory collector is a linked list used for collecting the memory containing errors; the user space memory error reporting program is a program used for reporting memory errors found in the detection process to a user; the reliable memory pool is a linked list which collects memories which are detected by a detection algorithm and have no errors and distributes the memories for the application program; the reliable memory pool timer is a timer for periodically cleaning expired memories in the reliable memory pool; the user space interface b is a configuration file for a user to set a reliable memory pool timer period and a memory expiration period in the reliable memory pool; all processes in the system refer to all application programs in the system; the application program memory migration timer is a timer which periodically migrates data and codes in a memory used by an application program and exceeding a certain period to a recently detected memory; the user space interface c is a timer for setting the period of a memory migration timer and the expiration period of a memory used by an application program; the main switch is a switch for opening and closing the MEMDOG; the user space interface d is a user space interface for operating the MEMDOG main switch.
2. A Linux based on-line memory tester (also known as memdmog) as in claim 1, wherein the test algorithm framework is implemented in the Linux kernel, the test algorithm framework providing an algorithm registration interface for collecting the memory test algorithms, the test algorithms collected in the test algorithm framework being selectable by the user space interface.
3. A Linux based on-line memory checker (also known as memlog) as claimed in claim 1, wherein the fault memory collector is a linked list existing in the kernel and used to collect memory containing faults.
4. A Linux-based online memory checker (also known as memlog) as claimed in claim 1, wherein the execution of the user space memory error reporting program is kernel-triggered.
5. A Linux based on-line memory checker (also known as memdmog) as claimed in claim 1, wherein the reliable memory pool is a linked list implemented in kernel space and used to collect memory checked by the memory checking algorithm.
6. The Linux-based online memory detector (also referred to as memdogg) as claimed in claim 1, wherein the reliable memory pool timer periodically checks the memory in the reliable memory pool, and releases an expired memory, and a period of the reliable memory pool timer and an expiration date of the memory can be set by the user space interface b.
7. A Linux-based online memory detector (also referred to as memlog) as in claim 1, wherein the memory migration timer periodically checks the memory used by the applications in the system, migrates the data and codes in the memory used by the applications for a period exceeding a certain period into the recently detected memory, and the period and memory expiration period of the memory migration timer can be set by the user space interface c.
8. A Linux-based online memory detector (also referred to as memdogo) as claimed in claim 1, wherein the main switch can turn on and off the online memory detector memdogo, and the main switch can be set by the user space interface d.
CN201710351727.8A 2017-05-18 2017-05-18 Linux-based online memory detector MEMDOG Active CN108958993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710351727.8A CN108958993B (en) 2017-05-18 2017-05-18 Linux-based online memory detector MEMDOG

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710351727.8A CN108958993B (en) 2017-05-18 2017-05-18 Linux-based online memory detector MEMDOG

Publications (2)

Publication Number Publication Date
CN108958993A CN108958993A (en) 2018-12-07
CN108958993B true CN108958993B (en) 2021-11-19

Family

ID=64461860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710351727.8A Active CN108958993B (en) 2017-05-18 2017-05-18 Linux-based online memory detector MEMDOG

Country Status (1)

Country Link
CN (1) CN108958993B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111223516B (en) * 2019-12-26 2021-09-07 曙光信息产业(北京)有限公司 RAID card detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110044A (en) * 2007-08-28 2008-01-23 中兴通讯股份有限公司 Method and system for internal memory monitoring management
CN102915276A (en) * 2012-09-25 2013-02-06 武汉邮电科学研究院 Memory control method for embedded systems
CN106598871A (en) * 2016-12-29 2017-04-26 山东鲁能智能技术有限公司 Automatic analysis method and system for collapse file under Linux

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9501389B1 (en) * 2015-08-20 2016-11-22 International Business Machines Corporation Test machine management

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110044A (en) * 2007-08-28 2008-01-23 中兴通讯股份有限公司 Method and system for internal memory monitoring management
CN102915276A (en) * 2012-09-25 2013-02-06 武汉邮电科学研究院 Memory control method for embedded systems
CN106598871A (en) * 2016-12-29 2017-04-26 山东鲁能智能技术有限公司 Automatic analysis method and system for collapse file under Linux

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MEI: A Light Weight Memory Error Injection Tool for Validating Online Memory Testers;Xiaoqiang Wang等;《IEEE》;20170116;全文 *
MEMDOG:一种基于Linux的在线内存检测器;王小强;《万方数据》;20180613;全文 *

Also Published As

Publication number Publication date
CN108958993A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN100419695C (en) Vectoring process-kill errors to an application program
Bautista-Gomez et al. Unprotected computing: A large-scale study of dram raw error rate on a supercomputer
Schroeder et al. DRAM errors in the wild: a large-scale field study
Hwang et al. Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design
Mukherjee et al. Cache scrubbing in microprocessors: Myth or necessity?
CN111552590B (en) Detection and recovery method and system for memory bit overturning of power secondary equipment
WO2017079454A1 (en) Storage error type determination
US7610523B1 (en) Method and template for physical-memory allocation for implementing an in-system memory test
US20110252276A1 (en) Low-overhead run-time memory leak detection and recovery
US7380169B2 (en) Converting merge buffer system-kill errors to process-kill errors
US20100306489A1 (en) Error management firewall in a multiprocessor computer
CN102272731A (en) Apparatus, system, and method for predicting failures in solid-state storage
CN105224888B (en) A kind of data of magnetic disk array protection system based on safe early warning technology
Siddiqua et al. Analysis and modeling of memory errors from large-scale field data collection
CN112559395B (en) Relay protection device and method based on dual-Soc storage system exception handling mechanism
US10095570B2 (en) Programmable device, error storage system, and electronic system device
US11586496B2 (en) Electronic circuit with integrated SEU monitor
Bottoni et al. Heavy ions test result on a 65nm sparc-v8 radiation-hard microprocessor
Meza Large scale studies of memory, storage, and network failures in a modern data center
CN108958993B (en) Linux-based online memory detector MEMDOG
Radojkovic et al. Towards resilient EU HPC systems: A blueprint
Dweik et al. Reliability-aware exceptions: Tolerating intermittent faults in microprocessor array structures
US7315961B2 (en) Black box recorder using machine check architecture in system management mode
CN105068969B (en) Single particle effect guard system and method for digital signal processing platform framework
CN104167224A (en) Method for reducing DRAM soft error

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant