CN101901174B - Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment - Google Patents
Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment Download PDFInfo
- Publication number
- CN101901174B CN101901174B CN2010102394264A CN201010239426A CN101901174B CN 101901174 B CN101901174 B CN 101901174B CN 2010102394264 A CN2010102394264 A CN 2010102394264A CN 201010239426 A CN201010239426 A CN 201010239426A CN 101901174 B CN101901174 B CN 101901174B
- Authority
- CN
- China
- Prior art keywords
- code segment
- code
- copy
- execution
- thread
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 95
- 230000007246 mechanism Effects 0.000 title claims abstract description 17
- 230000002708 enhancing effect Effects 0.000 title abstract 2
- 230000008569 process Effects 0.000 claims abstract description 76
- 238000012423 maintenance Methods 0.000 claims abstract description 33
- 230000004048 modification Effects 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 230000000052 comparative effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 5
- 230000001052 transient effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
Images
Abstract
The invention discloses a method for enhancing the reliability of a program of a multi-replica contrast mechanism based on a code segment. In the method, an kernel thread is started as an initialization thread, the initialization thread obtains a process memory information, two replicas of a process code segment are stored in a physical memory, the process code segment is also divided into memory domains with the fixed size, the kernel threads with a corresponding amount are started as a consistency maintenance thread to maintain the consistency of a process code domain and a replica thereof in real time according to a division result, and program code segment errors caused by various software and hardware faults are discovered and recovered in time. The invention is simple to realize, is transparent to a protected process and can efficiently detect the program code segment errors, thereby preventing program crash and even system invalidation caused by the code segment errors and improving the operation reliability of the program.
Description
Technical field
The invention belongs to computer realm, relate to Fault-tolerant Technique, particularly a kind ofly improve the program reliability method based on many copies of code segment contrast mechanism.
Background technology
Along with Internet development and computer utility constantly popularize and deeply, industry-by-industry, various application require increasingly high to computer system security.Especially towards applied more and more high-end, key business, Problem of System Reliability becomes and becomes increasingly conspicuous.In the enterprise, the server downtime is the one of the main reasons of loss profit in modern times.This shows, for need how to guarantee that the reliability of operation system is very important to the mechanism that the user provides uninterrupted service maybe need ensure information security.Fault-tolerant computer and correlation technique are arisen at the historic moment under this objective demand just, utilize the fault-tolerant calculation function to avoid the ten hundreds of economic loss that causes because of server failure.
Through being one of computer system fault-tolerant technique commonly used to operating in program protection on the computer system.Computer system provides the protection mechanism for program when design.Overwhelming majority processor all provides the virtual memory protection mechanism, and operating system can limit the write operation for program code through specific zone bit is set, and prevents the destruction of erroneous procedures or rogue program, thereby improves the reliability of program.But there is weak point in this traditional program resist technology:
1) changes and to detect for the code segment that causes owing to transient fault
The virtual memory protection mechanism that processor provides is merely able to detect because program is is illegally read and write unusual that code segment occurs, and can not detects for the read-write for code segment that causes owing to transient fault.For example since the code segment that causes of the too high internal storage location bit reversal that causes of temperature of processor change be to detect less than.
2) do not consider the influence that the development of hardware technology brings computer system
Integrated circuit fabrication process progressively develops into nanoscale in recent years; The minimizing of bottom integrated circuit characteristic dimension, the reduction of supply voltage and the rising of frequency; Make processor become responsive more, thereby the hardware transient fault takes place more easily for various noise.These transient faults possibly cause computer program code segments to be modified, and finally possibly cause service failure even entire system and lose efficacy.
Summary of the invention
To the defective that the said programmed protection of background technology technology exists, the object of the present invention is to provide a kind of realization simply, effectively improve the program reliability method based on many copies of code segment contrast mechanism.
For reaching above purpose, the present invention takes following technical scheme to be achieved:
A kind ofly improve the program reliability method, comprise the steps: based on many copies of code segment contrast mechanism
(1) in operating system, starts a kernel thread as initialization thread; Initialization thread extracts the code segment information of the program that needs protection through the memory information descriptor mm_struct among the process descriptors task_struct, and code segment information comprises code segment size, initial logical address, end logical address;
(2) initialization thread is set up the code segment copy of two parts of processes that need protection according to the code segment information of extracting in the step (1), and portion is called primary copy, and portion is called time copy in addition;
(3) initialize process is divided the code segment of the process of needing protection, and is divided into a plurality of codes zone to the process code segment, if the code segment of the process that needs protection size be the integral multiple of code area size, then the code area size that obtains of division is all identical; Otherwise last code area size is less than other code area size in the code zone that division obtains.
(4) initialize process is according to the code region quantity of dividing in the step (3), and the kernel thread that generates respective number is as the consistency maintenance thread;
(5) each consistency maintenance thread carries out the consistency maintenance work of code zone to the code zone of oneself being responsible for, and the consistency maintenance work that the consistency maintenance thread carries out is following:
(a) initialization respective data structures comprises the setting that is written into, compares pointer of code to be compared zone relevant information;
(b) judge according to consistency maintenance thread execution frequency configuration whether the consistency maintenance thread should sleep,, the consistency maintenance thread state then be set be sleep that the length of one's sleep is by carrying out the frequency decision if the stand-by period of frequency defined is carried out in no show;
(c) upgrade relatively pointer, relatively the content of current location process code segment and primary copy if content is consistent, and also has data relatively, then continues execution in step (c); Accomplish all relatively if content is consistent, then changeed step (b); Otherwise execution in step (d);
(d) if other CPU are arranged in the system, then stop the execution of other all CPU,, then send warning message to control desk if stop other CPU failures;
(e) judge the position of making a mistake: the content that compares process code segment, primary copy and inferior copy current location respectively; Possible comparative result and disposal route are following: a) the process code segment is consistent with time copy but primary copy is inconsistent with time copy content, changes step (f); B) the process code segment is inconsistent with time copy but primary copy is consistent with time copy content, commentaries on classics step (g); Process code segment and time copy are inconsistent and primary copy is inconsistent with time copy content, change step (h);
(f) this moment, primary copy made a mistake, and the content of the content modification primary copy current location of the current location of use time copy is recovered the primary copy mistake; If recover successfully; Then write down relevant error and recovering information, the execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(g) this moment, process code segment self made a mistake; The content of the content modification process code segment current location of the current location of use primary copy; Recovering process code segment mistake if recover successfully, then writes down relevant error and recovering information; The execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(h) can not judge that process code segment, primary copy still are that time copy makes a mistake, and kill the execution of current process, and the record relevant error information finish then this moment.
The copy of code segment described in the such scheme is meant the copy of the code segment that needs protected process.Two parts of code segments of the process that generation needs protection copy, and are kept at the position of two non-overlapping copies of Installed System Memory respectively.Based on the bottom hardware topological structure, two parts of code copy are preserved positions and are confirmed as follows: if bottom hardware is single processor system or smp system, in two physical memory modules that two parts of code copy leave in respectively; If bottom hardware is the NUMA system with a plurality of physical memory nodes, two parts of code copy are kept at respectively on two physical memory nodes.
The zone of code described in the such scheme is that the code segment to the process of needing protection carries out division result, and the default setting in each code zone is 10 pages, and system user is adjusted the size in code zone as required.
Consistency maintenance thread described in the such scheme, it carries out the regional execution temperature decision of code that frequency is responsible for by it.The execution temperature in code zone can obtain through the process that needs protection is tested.It is high more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is high more; It is low more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is low more.
Of the present inventionly improve the program reliability method based on many copies of code segment contrast mechanism; Through in physical memory, preserving two parts of process code segment copies; Start the consistance between kernel thread real-time servicing process code segment and the copy; The program code core dumped that discovery causes owing to various hardware and software failures, and in time recover.That this method realizes is simple, to and effectively trace routine code segment mistake transparent by the protection process, prevent the program crashing even the thrashing that cause owing to the code segment mistake to have improved the reliability of program run.
Description of drawings
Fig. 1 improves program reliability method workflow diagram for of the present invention based on many copies of code segment contrast mechanism.
Fig. 2 carries out the process flow diagram of code zone consistency maintenance work for consistency maintenance thread among Fig. 1.
Embodiment
Below in conjunction with accompanying drawing the present invention is done further detailed description.
As shown in Figure 1, the contrast mechanism improves the program reliability method based on many copies of code segment, comprises the steps:
(1) in operating system, starts a kernel thread as initialization thread; Initialization thread extracts the code segment information of the program that needs protection through the memory information descriptor mm_struct among the process descriptors task_struct, and code segment information comprises code segment size, initial logical address, end logical address;
(2) initialization thread is set up the code segment copy of two parts of processes that need protection according to the code segment information of extracting in the step (1), and portion is called primary copy, and portion is called time copy in addition;
(3) initialize process is divided the code segment of the process of needing protection, and is divided into a plurality of codes zone to the process code segment, if the code segment of the process that needs protection size be the integral multiple of code area size, then the code area size that obtains of division is all identical; Otherwise last code area size is less than other code area size in the code zone that division obtains.
(4) initialize process is according to the code region quantity of dividing in the step (3), and the kernel thread that generates respective number is as the consistency maintenance thread;
(5) each consistency maintenance thread carries out the consistency maintenance work of code zone to the code zone of oneself being responsible for.
Code zone consistency maintenance work as shown in Figure 2, that the consistency maintenance thread carries out among Fig. 1, flow process is following:
(a) initialization respective data structures comprises the setting that is written into, compares pointer of code to be compared zone relevant information;
(b) judge according to consistency maintenance thread execution frequency configuration whether the consistency maintenance thread should sleep,, the consistency maintenance thread state then be set be sleep that the length of one's sleep is by carrying out the frequency decision if the stand-by period of frequency defined is carried out in no show;
(c) upgrade relatively pointer, relatively the content of current location process code segment and primary copy if content is consistent, and also has data relatively, then continues execution in step (c); Accomplish all relatively if content is consistent, then changeed step (b); Otherwise execution in step (d);
(d) if other CPU are arranged in the system, then stop the execution of other all CPU,, then send warning message to control desk if stop other CPU failures;
(e) judge the position of making a mistake: the content that compares process code segment, primary copy and inferior copy current location respectively; Possible comparative result and disposal route are following: a) the process code segment is consistent with time copy but primary copy is inconsistent with time copy content, changes step (f); B) the process code segment is inconsistent with time copy but primary copy is consistent with time copy content, commentaries on classics step (g); Process code segment and time copy are inconsistent and primary copy is inconsistent with time copy content, change step (h);
(f) this moment, primary copy made a mistake, and the content of the content modification primary copy current location of the current location of use time copy is recovered the primary copy mistake; If recover successfully; Then write down relevant error and recovering information, the execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(g) this moment, process code segment self made a mistake; The content of the content modification process code segment current location of the current location of use primary copy; Recovering process code segment mistake if recover successfully, then writes down relevant error and recovering information; The execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(h) can not judge that process code segment, primary copy still are that time copy makes a mistake, and kill the execution of current process, and the record relevant error information finish then this moment.
Claims (4)
1. one kind is improved the program reliability method based on many copies of code segment contrast mechanism, it is characterized in that, comprises the steps:
(1) in operating system, starts a kernel thread as initialization thread; Initialization thread extracts the code segment information of the program that needs protection through the memory information descriptor mm_struct among the process descriptors task_struct, and code segment information comprises code segment size, initial logical address, end logical address;
(2) initialization thread is set up the code segment copy of two parts of processes that need protection according to the code segment information of extracting in the step (1), and portion is called primary copy, and portion is called time copy in addition;
(3) initialize process is divided the code segment of the process of needing protection, and is divided into a plurality of codes zone to the process code segment, if the code segment of the process that needs protection size be the integral multiple of code area size, then the code area size that obtains of division is all identical; Otherwise last code area size is less than other code area size in the code zone that division obtains;
(4) initialize process is according to the code region quantity of dividing in the step (3), and the kernel thread that generates respective number is as the consistency maintenance thread;
(5) each consistency maintenance thread carries out the consistency maintenance work of code zone to the code zone of oneself being responsible for, and the consistency maintenance work that the consistency maintenance thread carries out is following:
(a) initialization respective data structures comprises the setting that is written into, compares pointer of code to be compared zone relevant information;
(b) judge according to consistency maintenance thread execution frequency configuration whether the consistency maintenance thread should sleep,, the consistency maintenance thread state then be set be sleep that the length of one's sleep is by carrying out the frequency decision if the stand-by period of frequency defined is carried out in no show;
(c) upgrade relatively pointer, relatively the content of current location process code segment and primary copy if content is consistent, and also has data relatively, then continues execution in step (c); Accomplish all relatively if content is consistent, then changeed step (b); Otherwise execution in step (d);
(d) if other CPU are arranged in the system, then stop the execution of other all CPU,, then send warning message to control desk if stop other CPU failures;
(e) judge the position of making a mistake: the content that compares process code segment, primary copy and inferior copy current location respectively; Possible comparative result and disposal route are following: a) the process code segment is consistent with time copy but primary copy is inconsistent with time copy content, changes step (f); B) the process code segment is inconsistent with time copy but primary copy is consistent with time copy content, commentaries on classics step (g); Process code segment and time copy are inconsistent and primary copy is inconsistent with time copy content, change step (h);
(f) this moment, primary copy made a mistake, and the content of the content modification primary copy current location of the current location of use time copy is recovered the primary copy mistake; If recover successfully; Then write down relevant error and recovering information, the execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(g) this moment, process code segment self made a mistake; The content of the content modification process code segment current location of the current location of use primary copy; Recovering process code segment mistake if recover successfully, then writes down relevant error and recovering information; The execution of other CPU in the restarting systems is changeed step (c) and is carried out; If recover failure, then write down relevant error and recovering information, the execution of killing current process finishes then;
(h) can not judge that process code segment, primary copy still are that time copy makes a mistake, and kill the execution of current process, and the record relevant error information finish then this moment.
2. according to claim 1ly improve the program reliability method based on many copies of code segment contrast mechanism; It is characterized in that; Said code segment copy is meant the copy of the code segment of the process that needs protection, and generates two parts of code segments copies of the process that needs protection; Be kept at the position of two non-overlapping copies of Installed System Memory respectively; According to the bottom hardware topological structure, two parts of code copy are preserved positions and are confirmed as follows: if bottom hardware is single processor system or smp system, in two physical memory modules that two parts of code copy leave in respectively; If bottom hardware is the NUMA system with a plurality of physical memory nodes, two parts of code copy are kept at respectively on two physical memory nodes.
3. according to claim 1ly improve the program reliability method based on many copies of code segment contrast mechanism; It is characterized in that; Said code zone is that the code segment to the process of needing protection carries out division result; The default setting in each code zone is 10 pages, and system user is adjusted the size in code zone as required.
4. according to claim 1ly improve the program reliability method based on many copies of code segment contrast mechanism; It is characterized in that; Described consistency maintenance thread, it carries out the regional execution temperature decision of code that frequency is responsible for by it, and the execution temperature in code zone can obtain through the process that needs protection is tested; It is high more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is high more; It is low more that temperature is carried out in the code zone, and the execution frequency of consistency maintenance thread is low more.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102394264A CN101901174B (en) | 2010-07-28 | 2010-07-28 | Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102394264A CN101901174B (en) | 2010-07-28 | 2010-07-28 | Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101901174A CN101901174A (en) | 2010-12-01 |
CN101901174B true CN101901174B (en) | 2012-07-18 |
Family
ID=43226724
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010102394264A Expired - Fee Related CN101901174B (en) | 2010-07-28 | 2010-07-28 | Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101901174B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ES2539260T3 (en) | 2011-08-23 | 2015-06-29 | Huawei Technologies Co., Ltd. | Method and device for data reliability detection |
CN102360345A (en) * | 2011-10-11 | 2012-02-22 | 浪潮电子信息产业股份有限公司 | Method for realizing multiple copies of configurable shared library |
CN102508742B (en) * | 2011-11-03 | 2013-12-18 | 中国人民解放军国防科学技术大学 | Kernel code soft fault tolerance method for hardware unrecoverable memory faults |
CN102799494B (en) * | 2012-08-29 | 2015-11-11 | 南车株洲电力机车研究所有限公司 | A kind of method and apparatus verifying application program in internal memory |
CN104252419B (en) | 2014-09-16 | 2017-09-19 | 华为技术有限公司 | A kind of method and device of Memory Allocation |
CN106292976A (en) * | 2016-07-21 | 2017-01-04 | 张升泽 | Electronic chip builtin voltage distribution method and system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6308265B1 (en) * | 1998-09-30 | 2001-10-23 | Phoenix Technologies Ltd. | Protection of boot block code while allowing write accesses to the boot block |
CN100524216C (en) * | 2006-12-15 | 2009-08-05 | 英业达股份有限公司 | Method for updating system |
CN101154185A (en) * | 2007-08-27 | 2008-04-02 | 电子科技大学 | Method for performing recovery and playback when running software |
CN101604263A (en) * | 2009-07-13 | 2009-12-16 | 浪潮电子信息产业股份有限公司 | A kind of method that realizes multi-duplicate running of core code segment of operation system |
-
2010
- 2010-07-28 CN CN2010102394264A patent/CN101901174B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN101901174A (en) | 2010-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bautista-Gomez et al. | Unprotected computing: A large-scale study of dram raw error rate on a supercomputer | |
CN101901174B (en) | Method for enhancing reliability of program of multi-replica contrast mechanism based on code segment | |
US8417989B2 (en) | Method and system for extra redundancy in a raid system | |
US7409580B2 (en) | System and method for recovering from errors in a data processing system | |
TWI553650B (en) | Method, apparatus and system for handling data error events with a memory controller | |
US9690642B2 (en) | Salvaging event trace information in power loss interruption scenarios | |
CN106325773B (en) | A kind of consistency ensuring method of memory system data, system and buffer storage | |
US10303560B2 (en) | Systems and methods for eliminating write-hole problems on parity-based storage resources during an unexpected power loss | |
CN109656895B (en) | Distributed storage system, data writing method, device and storage medium | |
US9502139B1 (en) | Fine grained online remapping to handle memory errors | |
US9519545B2 (en) | Storage drive remediation in a raid system | |
US10324782B1 (en) | Hiccup management in a storage array | |
CN103226499A (en) | Method and device for restoring abnormal data in internal memory | |
Zheng et al. | Redundant memory array architecture for efficient selective protection | |
CN104699550A (en) | Error recovery method based on lockstep architecture | |
CN105868038B (en) | Memory error processing method and electronic equipment | |
KR101526110B1 (en) | Flash transition layor design framework for provably correct crash recovery | |
US20220374310A1 (en) | Write request completion notification in response to partial hardening of write data | |
Doudalis et al. | Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability | |
US20190026195A1 (en) | System halt event recovery | |
JP2513060B2 (en) | Failure recovery type computer | |
US10733097B2 (en) | Shingled magnetic recording storage system with reduced time to recover | |
US7934067B2 (en) | Data update history storage apparatus and data update history storage method | |
Zhang et al. | Software-Based Detecting and Recovering from ECC-Memory Faults | |
US9135110B2 (en) | Method and device for enhancing the reliability of a multiprocessor system by hybrid checkpointing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120718 Termination date: 20150728 |
|
EXPY | Termination of patent right or utility model |