CN100530106C - Method for implementing kernel of multi-machine fault-tolerant system - Google Patents

Method for implementing kernel of multi-machine fault-tolerant system Download PDF

Info

Publication number
CN100530106C
CN100530106C CNB200610161298XA CN200610161298A CN100530106C CN 100530106 C CN100530106 C CN 100530106C CN B200610161298X A CNB200610161298X A CN B200610161298XA CN 200610161298 A CN200610161298 A CN 200610161298A CN 100530106 C CN100530106 C CN 100530106C
Authority
CN
China
Prior art keywords
data
task
carry out
scheduler task
synchronization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB200610161298XA
Other languages
Chinese (zh)
Other versions
CN101000561A (en
Inventor
苗刚
陈文赛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Guorui Defense System Co Ltd
Original Assignee
CETC 14 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 14 Research Institute filed Critical CETC 14 Research Institute
Priority to CNB200610161298XA priority Critical patent/CN100530106C/en
Publication of CN101000561A publication Critical patent/CN101000561A/en
Application granted granted Critical
Publication of CN100530106C publication Critical patent/CN100530106C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

The invention discloses a method for realizing a kernel of a multi-machine fault-tolerant system, which forms a middleware between an application program and an operating system, and manages system resources and the application program in a multitask mode, so that the input and the output of the application program are completed through the middleware. Synchronization and data exchange between the arithmetic units and output voting are realized in the middleware. In the middleware, an interface with an application program is realized by establishing and managing a buffer area, synchronization among the operation units is realized by multitask management and data exchange, and output voting is realized by output data comparison. The invention realizes the fault-tolerant kernel by adding the software middleware between the application program and the operating system without a complex hardware circuit; the higher synchronization density is realized through multitask management and data exchange; output voting is realized through output data comparison, and an output module does not need to be newly added. And the fault-tolerant structures of various forms can be realized through the configuration of the middleware.

Description

The implementation method of multi-machine fault-tolerance system kermel
Affiliated technical field
The present invention relates to the implementation method of multi-machine fault-tolerance system kermel in a kind of computer automatic control system.
Background technology
In some automatic control system fields that relates to the great person and device security, require system to have high reliability, and require system not only when operate as normal, to guarantee security of system, and must guarantee when breaking down that system is by fault-safety principle.For this reason, high reliability, the high security multi-machine fault tolerance system based on fault-tolerant technique arises at the historic moment.Getting two computer systems with three is example, three computing machines is set as three arithmetic elements, thinks that just operation result is correct when having only wherein two or two above operation results consistent and is exported.
Realize that the fault-tolerant core technology of multimachine is synchronously, only the operation result that could export under the same state in a plurality of arithmetic elements under the synchronous prerequisite is used for voting.Synchronous mode has two kinds, and a kind of is close synchronization: promptly force a plurality of arithmetic elements in strict accordance with collaborative beat synchronous operation by the hardware synchronization device.But this mode need increase synchronization hardware equipment, and higher to the technical requirement that realizes this function.Another kind of mode is loose synchronous: promptly the method by software makes a plurality of arithmetic elements coordinate the state of near-synchronous by software under clock separately.Most of loose synchronization mechanisms all adopt regularly synchronous and the output method of synchronization, it is synchronous promptly to carry out timing in system start-up operation back timed sending synchronizing information, and in the output result of output terminal, and the output of after outputting results to, putting to the vote by a plurality of arithmetic elements of certain device wait.Chinese patent 00109094 is synchronous with regard to proposing to adopt synchronization frame timed sending program to carry out timing, and is connected in series a plurality of independent output unit OCM that solidified executive routine on the data line at the scene, makes three arithmetic elements realize synchronously and voting.This method of synchronization mainly is to realize by operating software, has saved the complicated hardware circuit for total system.But since only externally output unit just put to the vote, and three arithmetic elements do not have exchanges data when carrying out using, and have reduced the system synchronization precision.And need to increase output unit.
Summary of the invention
The objective of the invention is to overcome the technological deficiency that prior art exists, a kind of implementation method that realizes multi-machine fault-tolerance system kermel by software is provided.It can provide the higher system synchronization accuracy, makes the system failure be found as early as possible and is solved, and this method realizes in the mode of software fully simultaneously, based on Real-time Multi-task System, uses standard posix and operating system interface.Need not to increase hardware devices such as any input-output unit, and have extremely strong portability.
The present invention realizes the ultimate principle of goal of the invention: on the real-time multi-task operating system basis, between application program and operating system, form a middleware, with the mode management of system resource of multitask, application program etc., make the input and output of application program all finish by this middleware, by the interface between cache management realization middleware and the application program, by the interface between standard posix and operating system interface realization middleware and the operating system.In middleware, realize synchronous and exchanges data and output voting between the arithmetic element (claiming computing unit again).In middleware by setting up and the management buffer zone is realized interface with application program, by multiple task management and exchanges data realize between the arithmetic element synchronously, realize relatively that by output data output decides by vote.
The present invention is based on above-mentioned principle, for realizing that the technical scheme that goal of the invention adopted is: the implementation method of multi-machine fault-tolerance system kermel, between application program and operating system, form a software middleware, adopt mode management of system resource, the application program of multitask; Specifically be divided into application task, master scheduler task, computing unit status surveillance task and communication channel management role; Wherein:
Application task: application program is executed the task, and is created by master scheduler task, finishes voluntarily or is destroyed by master scheduler task;
Master scheduler task (scheduler task of main channel claims the scheduler task of the computing unit of main channel again): be responsible for the system-computed unit synchronously and data distribution, data comparison, computing scheduling;
Computing unit status surveillance task: be responsible for monitoring current computing unit and partner's computing unit communication state, and carry out the judgement of total system state;
Communication channel management role: be responsible for the monitor management communication channel, receive and send data;
The system kernel process: it is synchronous to carry out a process at the beginning of whole execution cycle, makes a plurality of computing units be in approximate synchronization and begins the computing in this cycle; Process is carried out the data input process synchronously later, a plurality of computing units respectively from outside obtain data, next just carry out a data sync (data sync 1), make the data that a plurality of computing unit obtained be consistent, make application program begin to carry out simultaneously at approximate synchronization; Promptly begin to start the application task executive utility after the input data sync; Treat to draw result of calculation after application program is finished, and result of calculation is distributed to partner's computing unit, carry out the data sync second time (data sync 2) after this, to guarantee that each computing unit has all obtained more required data, make that simultaneously comparison procedure begins to carry out at approximate synchronization as a result; Next carry out comparison procedure as a result, carry out data sync (data sync 3) for the third time after having drawn comparative result, to guarantee comparative result, the data of promptly preparing output are consistent, and make result data output carry out at approximate synchronization; Next, can select single or multiple arithmetic elements to export simultaneously.
The process synchronizing process: at the beginning of an execution cycle, at first send the process synchronic command as the scheduler task of the computing unit of main channel, scheduler task is suspended wait and replys from passage then; Hang up at first voluntarily from the channel scheduling task, wait for the process synchronic command of main channel; Receive when the communication channel management role to make immediately behind the process synchronic command of sending the main channel and reply; The communication channel management role of main channel receives that all send process execution command to all from passage from the process syn ack (or overtime) of passage back, provides this machine scheduler task simultaneously and continues the signal carried out, makes the scheduler task of main channel continue to carry out; , after the process execution command that has obtained to send the main channel, provide this machine scheduler task and continue the signal carried out from passage, make and continue to carry out from the scheduler task of passage; A plurality of like this arithmetic elements only differ the time of a propagation delay time, can regard three arithmetic elements approx as and bring into operation at synchronization.
The selection of main channel can be adopted implemented in many forms such as the alternating mode or the mode of trying to be the first.
Data synchronization process: scheduler task need data in synchronization, as import data distribution and calculate passage to the partner, send the data sync instruction subsequently, it is synchronous to show that data send the needs that finish, scheduler task is hung up voluntarily subsequently, waits for the Data Transfer Done (data sync instruction) of other arithmetic channels; Next the communication management task is after receiving the data sync instruction (or overtime) that all partner's arithmetic elements are sent here, the data that show all arithmetic channels are all received, then provide the signal that this machine scheduler task continues execution, make the scheduler task of this machine continue to carry out; Because each passage has all been waited until the data transmission of other all passages and has finished that therefore, a plurality of arithmetic elements of this moment can be regarded the step after synchronization is carried out data sync approx as.
The present invention realizes fault-tolerant kernel based on real-time multi-task operating system by add middleware (software) between application program and operating system, need not the complicated hardware circuit; By the interface between cache management realization middleware and the application program,, portable extremely strong by the interface between standard posix and operating system interface realization middleware and the operating system; Realize higher synchronous density by multiple task management and exchanges data; Relatively realize the output voting by output data, need not newly-increased output module.And can realize multiple multi-form fault-tolerant architecture by configuration to middleware.
Description of drawings
Fig. 1 is the execution cycle procedure chart (three get two tolerant systems) of multi-machine fault-tolerance system kermel implementation method of the present invention;
Fig. 2 is the process synchronizing process figure (three get two tolerant systems) of multi-machine fault-tolerance system kermel implementation method of the present invention;
Fig. 3 is the data synchronization process figure (three get two tolerant systems) of multi-machine fault-tolerance system kermel implementation method of the present invention;
Fig. 4 is multi-machine fault-tolerance system kermel implementation method flow chart of the present invention (alternating mode is adopted in the selection of main channel)
Fig. 5 is the program block that the selection of main channel in the multi-machine fault-tolerance system kermel implementation method of the present invention adopts the mode of trying to be the first to realize.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
Embodiment: getting two tolerant systems with three is example.The multi-machine fault-tolerance system kermel implementation method that the present invention proposes forms a software middleware between application program and operating system, the mode management of system resource of employing multitask, application program etc.Specifically can be divided into application task, master scheduler task, computing unit status surveillance task and communication channel management role etc., wherein:
Application task: application program is executed the task, and is created by master scheduler task, finishes voluntarily or is destroyed by master scheduler task.
Master scheduler task: be responsible for three and get the synchronous of two system computing unit and computing scheduling.
Computing unit status surveillance task: be responsible for monitoring current computing unit and partner's computing unit communication state, and carry out the judgement of total system state.
Communication channel management role: be responsible for the monitor management communication channel, receive and send data.
System's execution cycle process is as shown in Figure 1: it is synchronous to carry out a process at the beginning of whole execution cycle, makes three computing units be in approximate synchronization and begins the computing in this cycle.Process is carried out the data input process synchronously later.Three computing units respectively from outside obtain data, next just carry out a data sync (data sync 1), make three data that computing unit obtained be consistent, make application program begin to carry out simultaneously at approximate synchronization.Promptly begin to start the application task executive utility after the input data sync.Treat to draw result of calculation after application program is finished, and result of calculation is distributed to partner's computing unit, carry out the data sync second time (data sync 2) after this, to guarantee that each computing unit has all obtained more required data, make that simultaneously comparison procedure begins to carry out at approximate synchronization as a result.Next carry out comparison procedure as a result, carry out data sync (data sync 3) for the third time after having drawn comparative result, to guarantee comparative result, the data of promptly preparing output are consistent, and make result data output carry out at approximate synchronization.Next, can select single, two or three arithmetic elements are exported simultaneously.
The process synchronizing process is as shown in Figure 2: at the beginning of an execution cycle, at first send the process synchronic command as the scheduler task of the computing unit of main channel, scheduler task is suspended wait and replys from passage then.Hang up at first voluntarily from the channel scheduling task, wait for the process synchronic command of main channel.Receive when the communication channel management role to make immediately behind the process synchronic command of sending the main channel and reply.The communication channel management role of main channel receives that all send process execution command to all from passage from the process syn ack (or overtime) of passage back, provides this machine scheduler task simultaneously and continues the signal carried out, makes the scheduler task of main channel continue to carry out., after the process execution command that has obtained to send the main channel, provide this machine scheduler task and continue the signal carried out from passage, make and continue to carry out from the scheduler task of passage.Such three arithmetic elements only differ the time of a propagation delay time, can regard three arithmetic elements approx as and bring into operation at synchronization.
In the present embodiment, alternating mode is adopted in the selection of main channel.
Data synchronization process is as shown in Figure 3: scheduler task need data in synchronization, as import data distribution and calculate passage to the partner, send the data sync instruction subsequently, it is synchronous to show that data send the needs that finish, scheduler task is hung up voluntarily subsequently, waits for the Data Transfer Done (data sync instruction) of other arithmetic channels.Next the communication management task shows that the data of all arithmetic channels are all received after receiving the data sync instruction (or overtime) that all partner's arithmetic elements are sent here.Then provide the signal that this machine scheduler task continues execution, make the scheduler task of this machine continue to carry out.Because each passage has all been waited until the data transmission of other all passages and has finished that therefore, three arithmetic elements of this moment can be regarded the step after synchronization is carried out data sync approx as.
What accompanying drawing 4 provided is the software middleware flow chart (alternating mode is adopted in the selection of main channel) of present embodiment.Each execution cycle carries out a main channel wheel to be changeed.Three arithmetic elements are done master unit by turns.The managed together total system.
Because arithmetic element when adding system again, may produce the host computer identification conflict, therefore the conflict evading strategy need be set, when main frame conflict collision takes place, adopt the postorder that pre-sets to evade collision in the present embodiment by the method for preorder.
Embodiment 2, basic identical with embodiment, different be that the selection of main channel adopts the mode of trying to be the first to realize.
Accompanying drawing 5 has provided the flow chart that the selection of main channel adopts the mode of trying to be the first to realize in the present embodiment.Under this mode, computing unit status surveillance task sends a status poll instruction in each channel status sense cycle to partner's passage.When returning, check the communication state record sheet, when meeting the following conditions, this machine is made as main frame.
Condition one: have only under the partner's passage and the situation that self connects, judge another partner's passage whether really not online (self judge and from the state consistency of partner's passage); The Host Status of replying partner's passage is non-main frame; Self Host Status is non-main frame.Be made as main frame with self Host Status this moment.
Under two: two partner's passages of condition and the situation that all self connects, judge whether state consistency (self judgement and from the state consistency of partner's passage) of two partner's passages; The Host Status of two partner's passages is non-main frame; Self Host Status is non-main frame.Be made as main frame with self Host Status this moment.
Because three main frames are not the time synchronized on the complete meaning, therefore, there is certain error in interchannel host computer identification, may cause multiple host to be judged as main frame simultaneously.When in a single day the status surveillance task detects self Host Status and partner's passage Host Status and clash, then self Host Status is removed and waited for next time and judging, to evade conflict.

Claims (4)

1, a kind of implementation method of multi-machine fault-tolerance system kermel forms a software middleware between application program and operating system, adopt mode management of system resource, the application program of multitask; Specifically be divided into application task, master scheduler task, computing unit status surveillance task and communication channel management role; Wherein:
Application task: promptly application program is executed the task, and is created by master scheduler task, finishes voluntarily or is destroyed by master scheduler task;
Master scheduler task: be responsible for synchronous and data distribution, data comparison, the computing scheduling of system-computed unit;
Computing unit status surveillance task: be responsible for monitoring current computing unit and partner's computing unit communication state, and carry out the judgement of total system state;
Communication channel management role: be responsible for the monitor management communication channel, receive and send data;
The system kernel process: it is synchronous to carry out a process at the beginning of whole execution cycle, makes a plurality of computing units be in approximate synchronization and begins the computing in this cycle; Process is carried out the data input process synchronously later, a plurality of computing units respectively from outside obtain data, next just carry out the data sync first time, make the data that a plurality of computing unit obtained be consistent, make application program begin to carry out simultaneously at approximate synchronization; Promptly begin to start the application task executive utility after the input data sync; Treat to draw result of calculation after application program is finished, and result of calculation is distributed to partner's computing unit, carry out the data sync second time after this, all obtained more required data, make that simultaneously comparison procedure begins to carry out at approximate synchronization as a result to guarantee each computing unit; Next carry out comparison procedure as a result, carry out data sync for the third time after having drawn comparative result, to guarantee comparative result, the data of promptly preparing output are consistent, and make result data output carry out at approximate synchronization; Next, can select single or multiple arithmetic elements to export simultaneously.
2, according to the implementation method of the described multi-machine fault-tolerance system kermel of claim 1, it is characterized in that: described process synchronizing process is: at the beginning of an execution cycle, scheduler task as the computing unit of main channel is at first sent the process synchronic command, and scheduler task is suspended wait and replys from passage then; Hang up at first voluntarily from the channel scheduling task, wait for the process synchronic command of main channel; Receive when the communication channel management role to make immediately behind the process synchronic command of sending the main channel and reply; The communication channel management role of main channel receives that all send process execution command to all from passage from the process syn ack of passage or overtime back, provides this machine scheduler task simultaneously and continues the signal carried out, makes the scheduler task of main channel continue to carry out; , after the process execution command that has obtained to send the main channel, provide this machine scheduler task and continue the signal carried out from passage, make and continue to carry out from the scheduler task of passage; A plurality of like this arithmetic elements only differ the time of a propagation delay time, can regard three arithmetic elements approx as and bring into operation at synchronization.
3, according to the implementation method of the described multi-machine fault-tolerance system kermel of claim 2, it is characterized in that: the selection of described main channel adopts alternating mode or the mode of trying to be the first to realize.
4, according to the implementation method of the described multi-machine fault-tolerance system kermel of claim 1, it is characterized in that: described data synchronization process is: scheduler task need data in synchronization be distributed to the partner and calculates passage, send the data sync instruction subsequently, it is synchronous to show that data send the needs that finish, scheduler task is hung up voluntarily subsequently, wait for the Data Transfer Done of other arithmetic channels, i.e. data sync instruction; Next the communication channel management role receive the instruction of data sync that all partner's arithmetic elements are sent here or overtime after, the data that show all arithmetic channels are all received, then provide the signal that this machine scheduler task continues execution, make the scheduler task of this machine continue to carry out; Because each passage has all been waited until the data transmission of other all passages and has finished that therefore, a plurality of arithmetic elements of this moment can be regarded the step after synchronization is carried out data sync approx as.
CNB200610161298XA 2006-12-20 2006-12-20 Method for implementing kernel of multi-machine fault-tolerant system Active CN100530106C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200610161298XA CN100530106C (en) 2006-12-20 2006-12-20 Method for implementing kernel of multi-machine fault-tolerant system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200610161298XA CN100530106C (en) 2006-12-20 2006-12-20 Method for implementing kernel of multi-machine fault-tolerant system

Publications (2)

Publication Number Publication Date
CN101000561A CN101000561A (en) 2007-07-18
CN100530106C true CN100530106C (en) 2009-08-19

Family

ID=38692545

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200610161298XA Active CN100530106C (en) 2006-12-20 2006-12-20 Method for implementing kernel of multi-machine fault-tolerant system

Country Status (1)

Country Link
CN (1) CN100530106C (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794242B (en) * 2010-01-29 2012-07-18 西安交通大学 Fault-tolerant computer system data comparing method serving operating system core layer
CN102298324B (en) * 2011-06-21 2013-04-17 东华大学 Cooperative intelligent accurate fault-tolerance controller and method thereof
CN108804109B (en) * 2018-06-07 2021-11-05 北京四方继保自动化股份有限公司 Industrial deployment and control method based on multi-path functional equivalent module redundancy arbitration

Also Published As

Publication number Publication date
CN101000561A (en) 2007-07-18

Similar Documents

Publication Publication Date Title
CN105930580B (en) Time synchronization and data exchange device and method for joint simulation of power system and information communication system
CN107483135B (en) A kind of the time trigger Ethernet device and method of high synchronization
CN106647613B (en) PLC (programmable logic controller) dual-machine redundancy method and system based on MAC (media access control)
CN103199972B (en) The two-node cluster hot backup changing method realized based on SOA, RS485 bus and hot backup system
CN102984042B (en) Deterministic scheduling method and system for realizing bus communication
CN103377083A (en) Method of redundant automation system for operating the redundant automation system
CN106603367A (en) CAN bus communication method for time synchronization
CN109507866A (en) A kind of double-machine redundancy system and method based on network address drift technology
CN106790694A (en) The dispatching method of destination object in distributed system and distributed system
KR20090067152A (en) Cluster coupler unit and method for synchronizing a plurality of clusters in a time-triggered network
CN102724083A (en) Degradable triple-modular redundancy computer system based on software synchronization
CN101625568A (en) Synchronous data controller based hot standby system of main control unit and method thereof
CN102830647A (en) Double 2-vote-2 device for fail safety
CN101790230A (en) Precision time protocol node, time stamp operation method and time synchronization system
CN101009546A (en) Time synchronization method for network segment utilizing different time synchronization protocol
CN100527661C (en) Method and system for realizing multi-clock synchronization
CN111899501A (en) Remote control method for switch in distribution network automation master station substation
CN100530106C (en) Method for implementing kernel of multi-machine fault-tolerant system
CN105608039B (en) A kind of double redundancy computer cycle control system and method based on FIFO and ARINC659 bus
KR101704751B1 (en) Apparatus for simulating of multi-core system by using timing information between modules, and method thereof
CN106712887B (en) A kind of principal and subordinate's two-shipper state synchronization method based on Network Time Protocol
CN105577310B (en) The synchronous method of task partition and communication scheduling in a kind of time triggered Ethernet
US8527741B2 (en) System for selectively synchronizing high-assurance software tasks on multiple processors at a software routine level
CN202795349U (en) Serial bus data analyzer and analysis system
CN111464346B (en) Main and standby control board synchronization method and system based on ATCA (advanced telecom computing architecture)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191008

Address after: Room 1401, floor 14, block a, building 1, Guorui building, No. 359, Jiangdong Middle Road, Jianye District, Nanjing City, Jiangsu Province, 210019

Patentee after: Nanjing Guorui Defense System Co., Ltd.

Address before: 1313 box 03, box 210014, Nanjing City, Jiangsu Province

Patentee before: No. 14 Inst., China Electronic Science & Technology Group Corp.

TR01 Transfer of patent right