CN101901174B - Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment - Google Patents

Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment Download PDF

Info

Publication number
CN101901174B
CN101901174B CN2010102394264A CN201010239426A CN101901174B CN 101901174 B CN101901174 B CN 101901174B CN 2010102394264 A CN2010102394264 A CN 2010102394264A CN 201010239426 A CN201010239426 A CN 201010239426A CN 101901174 B CN101901174 B CN 101901174B
Authority
CN
China
Prior art keywords
code segment
code
copy
execution
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010102394264A
Other languages
Chinese (zh)
Other versions
CN101901174A (en
Inventor
张兴军
董小社
雷济凯
郑豪
刘鹏飞
王恩东
胡雷钧
张东
伍卫国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Xian Jiaotong University
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd, Xian Jiaotong University filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN2010102394264A priority Critical patent/CN101901174B/en
Publication of CN101901174A publication Critical patent/CN101901174A/en
Application granted granted Critical
Publication of CN101901174B publication Critical patent/CN101901174B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for enhancing the reliability of a program of a multi-replica contrast mechanism based on a code segment. In the method, an kernel thread is started as an initialization thread, the initialization thread obtains a process memory information, two replicas of a process code segment are stored in a physical memory, the process code segment is also divided into memory domains with the fixed size, the kernel threads with a corresponding amount are started as a consistency maintenance thread to maintain the consistency of a process code domain and a replica thereof in real time according to a division result, and program code segment errors caused by various software and hardware faults are discovered and recovered in time. The invention is simple to realize, is transparent to a protected process and can efficiently detect the program code segment errors, thereby preventing program crash and even system invalidation caused by the code segment errors and improving the operation reliability of the program.

Description

The contrast mechanism improves the program reliability method based on many copies of code segment
Technical field
The invention belongs to computer realm, relate to Fault-tolerant Technique, particularly a kind ofly improve the program reliability method based on many copies of code segment contrast mechanism.
Background technology
Along with Internet development and computer utility constantly popularize and deeply, industry-by-industry, various application require increasingly high to computer system security.Especially towards applied more and more high-end, key business, Problem of System Reliability becomes and becomes increasingly conspicuous.In the enterprise, the server downtime is the one of the main reasons of loss profit in modern times.This shows, for need how to guarantee that the reliability of operation system is very important to the mechanism that the user provides uninterrupted service maybe need ensure information security.Fault-tolerant computer and correlation technique are arisen at the historic moment under this objective demand just, utilize the fault-tolerant calculation function to avoid the ten hundreds of economic loss that causes because of server failure.
Through being one of computer system fault-tolerant technique commonly used to operating in program protection on the computer system.Computer system provides the protection mechanism for program when design.Overwhelming majority processor all provides the virtual memory protection mechanism, and operating system can limit the write operation for program code through specific zone bit is set, and prevents the destruction of erroneous procedures or rogue program, thereby improves the reliability of program.But there is weak point in this traditional program resist technology:
1) changes and to detect for the code segment that causes owing to transient fault
The virtual memory protection mechanism that processor provides is merely able to detect because program is is illegally read and write unusual that code segment occurs, and can not detects for the read-write for code segment that causes owing to transient fault.For example since the code segment that causes of the too high internal storage location bit reversal that causes of temperature of processor change be to detect less than.
2) do not consider the influence that the development of hardware technology brings computer system
Integrated circuit fabrication process progressively develops into nanoscale in recent years; The minimizing of bottom integrated circuit characteristic dimension, the reduction of supply voltage and the rising of frequency; Make processor become responsive more, thereby the hardware transient fault takes place more easily for various noise.These transient faults possibly cause computer program code segments to be modified, and finally possibly cause service failure even entire system and lose efficacy.
Summary of the invention
To the defective that the said programmed protection of background technology technology exists, the object of the present invention is to provide a kind of realization simply, effectively improve the program reliability method based on many copies of code segment contrast mechanism.
For reaching above purpose, the present invention takes following technical scheme to be achieved:
A kind ofly improve the program reliability method, comprise the steps: based on many copies of code segment contrast mechanism
(1) in operating system, starts a kernel thread as initialization thread; Initialization thread extracts the code segment information of the program that needs protection through the memory information descriptor mm_struct among the process descriptors task_struct, and code segment information comprises code segment size, initial logical address, end logical address;
(2) initialization thread is set up the code segment copy of two parts of processes that need protection according to the code segment information of extracting in the step (1), and portion is called primary copy, and portion is called time copy in addition;
(3) initialize process is divided the code segment of the process of needing protection, and is divided into a plurality of codes zone to the process code segment, if the code segment of the process that needs protection size be the integral multiple of code area size, then the code area size that obtains of division is all identical; Otherwise last code area size is less than other code area size in the code zone that division obtains.
(4) initialize process is according to the code region quantity of dividing in the step (3), and the kernel thread that generates respective number is as the consistency maintenance thread;
(5) each consistency maintenance thread carries out the consistency maintenance work of code zone to the code zone of oneself being responsible for, and the consistency maintenance work that the consistency maintenance thread carries out is following:
(a) initialization respective data structures comprises the setting that is written into, compares pointer of code to be compared zone relevant information;
(b) judge according to consistency maintenance thread execution frequency configuration whether the consistency maintenance thread should sleep,, the consistency maintenance thread state then be set be sleep that the length of one's sleep is by carrying out the frequency decision if the stand-by period of frequency defined is carried out in no show;
(c) upgrade relatively pointer, relatively the content of current location process code segment and primary copy if content is consistent, and also has data relatively, then continues execution in step (c); Accomplish all relatively if content is consistent, then changeed step (b); Otherwise execution in step (d);
(d) if other CPU are arranged in the system, then stop the execution of other all CPU,, then send warning message to control desk if stop other CPU failures;
(e) judge the position of making a mistake: the content that compares process code segment, primary copy and inferior copy current location respectively; Possible comparative result and disposal route are following: a) the process code segment is consistent with time copy but primary copy is inconsistent with time copy content, changes step (f); B) the process code segment is inconsistent with time copy but primary copy is consistent with time copy content, commentaries on classics step (g); Process code segment and time copy are inconsistent and primary copy is inconsistent with time copy content, change step (h);
(f) this moment, primary copy made a mistake, and the content of the content modification primary copy current location of the current location of use time copy is recovered the primary copy mistake; If recover successfully; Then write down relevant error and recovering information, the execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(g) this moment, process code segment self made a mistake; The content of the content modification process code segment current location of the current location of use primary copy; Recovering process code segment mistake if recover successfully, then writes down relevant error and recovering information; The execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(h) can not judge that process code segment, primary copy still are that time copy makes a mistake, and kill the execution of current process, and the record relevant error information finish then this moment.
The copy of code segment described in the such scheme is meant the copy of the code segment that needs protected process.Two parts of code segments of the process that generation needs protection copy, and are kept at the position of two non-overlapping copies of Installed System Memory respectively.Based on the bottom hardware topological structure, two parts of code copy are preserved positions and are confirmed as follows: if bottom hardware is single processor system or smp system, in two physical memory modules that two parts of code copy leave in respectively; If bottom hardware is the NUMA system with a plurality of physical memory nodes, two parts of code copy are kept at respectively on two physical memory nodes.
The zone of code described in the such scheme is that the code segment to the process of needing protection carries out division result, and the default setting in each code zone is 10 pages, and system user is adjusted the size in code zone as required.
Consistency maintenance thread described in the such scheme, it carries out the regional execution temperature decision of code that frequency is responsible for by it.The execution temperature in code zone can obtain through the process that needs protection is tested.It is high more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is high more; It is low more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is low more.
Of the present inventionly improve the program reliability method based on many copies of code segment contrast mechanism; Through in physical memory, preserving two parts of process code segment copies; Start the consistance between kernel thread real-time servicing process code segment and the copy; The program code core dumped that discovery causes owing to various hardware and software failures, and in time recover.That this method realizes is simple, to and effectively trace routine code segment mistake transparent by the protection process, prevent the program crashing even the thrashing that cause owing to the code segment mistake to have improved the reliability of program run.
Description of drawings
Fig. 1 improves program reliability method workflow diagram for of the present invention based on many copies of code segment contrast mechanism.
Fig. 2 carries out the process flow diagram of code zone consistency maintenance work for consistency maintenance thread among Fig. 1.
Embodiment
Below in conjunction with accompanying drawing the present invention is done further detailed description.
As shown in Figure 1, the contrast mechanism improves the program reliability method based on many copies of code segment, comprises the steps:
(1) in operating system, starts a kernel thread as initialization thread; Initialization thread extracts the code segment information of the program that needs protection through the memory information descriptor mm_struct among the process descriptors task_struct, and code segment information comprises code segment size, initial logical address, end logical address;
(2) initialization thread is set up the code segment copy of two parts of processes that need protection according to the code segment information of extracting in the step (1), and portion is called primary copy, and portion is called time copy in addition;
(3) initialize process is divided the code segment of the process of needing protection, and is divided into a plurality of codes zone to the process code segment, if the code segment of the process that needs protection size be the integral multiple of code area size, then the code area size that obtains of division is all identical; Otherwise last code area size is less than other code area size in the code zone that division obtains.
(4) initialize process is according to the code region quantity of dividing in the step (3), and the kernel thread that generates respective number is as the consistency maintenance thread;
(5) each consistency maintenance thread carries out the consistency maintenance work of code zone to the code zone of oneself being responsible for.
Code zone consistency maintenance work as shown in Figure 2, that the consistency maintenance thread carries out among Fig. 1, flow process is following:
(a) initialization respective data structures comprises the setting that is written into, compares pointer of code to be compared zone relevant information;
(b) judge according to consistency maintenance thread execution frequency configuration whether the consistency maintenance thread should sleep,, the consistency maintenance thread state then be set be sleep that the length of one's sleep is by carrying out the frequency decision if the stand-by period of frequency defined is carried out in no show;
(c) upgrade relatively pointer, relatively the content of current location process code segment and primary copy if content is consistent, and also has data relatively, then continues execution in step (c); Accomplish all relatively if content is consistent, then changeed step (b); Otherwise execution in step (d);
(d) if other CPU are arranged in the system, then stop the execution of other all CPU,, then send warning message to control desk if stop other CPU failures;
(e) judge the position of making a mistake: the content that compares process code segment, primary copy and inferior copy current location respectively; Possible comparative result and disposal route are following: a) the process code segment is consistent with time copy but primary copy is inconsistent with time copy content, changes step (f); B) the process code segment is inconsistent with time copy but primary copy is consistent with time copy content, commentaries on classics step (g); Process code segment and time copy are inconsistent and primary copy is inconsistent with time copy content, change step (h);
(f) this moment, primary copy made a mistake, and the content of the content modification primary copy current location of the current location of use time copy is recovered the primary copy mistake; If recover successfully; Then write down relevant error and recovering information, the execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(g) this moment, process code segment self made a mistake; The content of the content modification process code segment current location of the current location of use primary copy; Recovering process code segment mistake if recover successfully, then writes down relevant error and recovering information; The execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(h) can not judge that process code segment, primary copy still are that time copy makes a mistake, and kill the execution of current process, and the record relevant error information finish then this moment.

Claims (4)

1. one kind is improved the program reliability method based on many copies of code segment contrast mechanism, it is characterized in that, comprises the steps:
(1) in operating system, starts a kernel thread as initialization thread; Initialization thread extracts the code segment information of the program that needs protection through the memory information descriptor mm_struct among the process descriptors task_struct, and code segment information comprises code segment size, initial logical address, end logical address;
(2) initialization thread is set up the code segment copy of two parts of processes that need protection according to the code segment information of extracting in the step (1), and portion is called primary copy, and portion is called time copy in addition;
(3) initialize process is divided the code segment of the process of needing protection, and is divided into a plurality of codes zone to the process code segment, if the code segment of the process that needs protection size be the integral multiple of code area size, then the code area size that obtains of division is all identical; Otherwise last code area size is less than other code area size in the code zone that division obtains;
(4) initialize process is according to the code region quantity of dividing in the step (3), and the kernel thread that generates respective number is as the consistency maintenance thread;
(5) each consistency maintenance thread carries out the consistency maintenance work of code zone to the code zone of oneself being responsible for, and the consistency maintenance work that the consistency maintenance thread carries out is following:
(a) initialization respective data structures comprises the setting that is written into, compares pointer of code to be compared zone relevant information;
(b) judge according to consistency maintenance thread execution frequency configuration whether the consistency maintenance thread should sleep,, the consistency maintenance thread state then be set be sleep that the length of one's sleep is by carrying out the frequency decision if the stand-by period of frequency defined is carried out in no show;
(c) upgrade relatively pointer, relatively the content of current location process code segment and primary copy if content is consistent, and also has data relatively, then continues execution in step (c); Accomplish all relatively if content is consistent, then changeed step (b); Otherwise execution in step (d);
(d) if other CPU are arranged in the system, then stop the execution of other all CPU,, then send warning message to control desk if stop other CPU failures;
(e) judge the position of making a mistake: the content that compares process code segment, primary copy and inferior copy current location respectively; Possible comparative result and disposal route are following: a) the process code segment is consistent with time copy but primary copy is inconsistent with time copy content, changes step (f); B) the process code segment is inconsistent with time copy but primary copy is consistent with time copy content, commentaries on classics step (g); Process code segment and time copy are inconsistent and primary copy is inconsistent with time copy content, change step (h);
(f) this moment, primary copy made a mistake, and the content of the content modification primary copy current location of the current location of use time copy is recovered the primary copy mistake; If recover successfully; Then write down relevant error and recovering information, the execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(g) this moment, process code segment self made a mistake; The content of the content modification process code segment current location of the current location of use primary copy; Recovering process code segment mistake if recover successfully, then writes down relevant error and recovering information; The execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(h) can not judge that process code segment, primary copy still are that time copy makes a mistake, and kill the execution of current process, and the record relevant error information finish then this moment.
2. according to claim 1ly improve the program reliability method based on many copies of code segment contrast mechanism; It is characterized in that; Said code segment copy is meant the copy of the code segment of the process that needs protection, and generates two parts of code segments copies of the process that needs protection; Be kept at the position of two non-overlapping copies of Installed System Memory respectively; According to the bottom hardware topological structure, two parts of code copy are preserved positions and are confirmed as follows: if bottom hardware is single processor system or smp system, in two physical memory modules that two parts of code copy leave in respectively; If bottom hardware is the NUMA system with a plurality of physical memory nodes, two parts of code copy are kept at respectively on two physical memory nodes.
3. according to claim 1ly improve the program reliability method based on many copies of code segment contrast mechanism; It is characterized in that; Said code zone is that the code segment to the process of needing protection carries out division result; The default setting in each code zone is 10 pages, and system user is adjusted the size in code zone as required.
4. according to claim 1ly improve the program reliability method based on many copies of code segment contrast mechanism; It is characterized in that; Described consistency maintenance thread, it carries out the regional execution temperature decision of code that frequency is responsible for by it, and the execution temperature in code zone can obtain through the process that needs protection is tested; It is high more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is high more; It is low more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is low more.
CN2010102394264A 2010-07-28 2010-07-28 Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment Expired - Fee Related CN101901174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102394264A CN101901174B (en) 2010-07-28 2010-07-28 Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102394264A CN101901174B (en) 2010-07-28 2010-07-28 Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment

Publications (2)

Publication Number Publication Date
CN101901174A CN101901174A (en) 2010-12-01
CN101901174B true CN101901174B (en) 2012-07-18

Family

ID=43226724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102394264A Expired - Fee Related CN101901174B (en) 2010-07-28 2010-07-28 Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment

Country Status (1)

Country Link
CN (1) CN101901174B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2539260T3 (en) 2011-08-23 2015-06-29 Huawei Technologies Co., Ltd. Method and device for data reliability detection
CN102360345A (en) * 2011-10-11 2012-02-22 浪潮电子信息产业股份有限公司 Method for realizing multiple copies of configurable shared library
CN102508742B (en) * 2011-11-03 2013-12-18 中国人民解放军国防科学技术大学 Kernel code soft fault tolerance method for hardware unrecoverable memory faults
CN102799494B (en) * 2012-08-29 2015-11-11 南车株洲电力机车研究所有限公司 A kind of method and apparatus verifying application program in internal memory
CN104252419B (en) 2014-09-16 2017-09-19 华为技术有限公司 A kind of method and device of Memory Allocation
CN106292976A (en) * 2016-07-21 2017-01-04 张升泽 Electronic chip builtin voltage distribution method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6308265B1 (en) * 1998-09-30 2001-10-23 Phoenix Technologies Ltd. Protection of boot block code while allowing write accesses to the boot block
CN100524216C (en) * 2006-12-15 2009-08-05 英业达股份有限公司 Method for updating system
CN101154185A (en) * 2007-08-27 2008-04-02 电子科技大学 Method for performing recovery and playback when running software
CN101604263A (en) * 2009-07-13 2009-12-16 浪潮电子信息产业股份有限公司 A kind of method that realizes multi-duplicate running of core code segment of operation system

Also Published As

Publication number Publication date
CN101901174A (en) 2010-12-01

Similar Documents

Publication Publication Date Title
Bautista-Gomez et al. Unprotected computing: A large-scale study of dram raw error rate on a supercomputer
CN101901174B (en) Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment
US8417989B2 (en) Method and system for extra redundancy in a raid system
US7409580B2 (en) System and method for recovering from errors in a data processing system
TWI553650B (en) Method, apparatus and system for handling data error events with a memory controller
US9690642B2 (en) Salvaging event trace information in power loss interruption scenarios
CN106325773B (en) A kind of consistency ensuring method of memory system data, system and buffer storage
US10303560B2 (en) Systems and methods for eliminating write-hole problems on parity-based storage resources during an unexpected power loss
CN109656895B (en) Distributed storage system, data writing method, device and storage medium
US9502139B1 (en) Fine grained online remapping to handle memory errors
US9519545B2 (en) Storage drive remediation in a raid system
US10324782B1 (en) Hiccup management in a storage array
CN103226499A (en) Method and device for restoring abnormal data in internal memory
Zheng et al. Redundant memory array architecture for efficient selective protection
CN104699550A (en) Error recovery method based on lockstep architecture
CN105868038B (en) Memory error processing method and electronic equipment
KR101526110B1 (en) Flash transition layor design framework for provably correct crash recovery
US20220374310A1 (en) Write request completion notification in response to partial hardening of write data
Doudalis et al. Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability
US20190026195A1 (en) System halt event recovery
JP2513060B2 (en) Failure recovery type computer
US10733097B2 (en) Shingled magnetic recording storage system with reduced time to recover
US7934067B2 (en) Data update history storage apparatus and data update history storage method
Zhang et al. Software-Based Detecting and Recovering from ECC-Memory Faults
US9135110B2 (en) Method and device for enhancing the reliability of a multiprocessor system by hybrid checkpointing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120718

Termination date: 20150728

EXPY Termination of patent right or utility model