CN101727629A - Self-organization distribution business system - Google Patents

Self-organization distribution business system Download PDF

Info

Publication number
CN101727629A
CN101727629A CN200810223932A CN200810223932A CN101727629A CN 101727629 A CN101727629 A CN 101727629A CN 200810223932 A CN200810223932 A CN 200810223932A CN 200810223932 A CN200810223932 A CN 200810223932A CN 101727629 A CN101727629 A CN 101727629A
Authority
CN
China
Prior art keywords
sodbs
server
software
code
overseer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810223932A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Property & Credit Guarantee Co Ltd
Original Assignee
Beijing Property & Credit Guarantee Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Property & Credit Guarantee Co Ltd filed Critical Beijing Property & Credit Guarantee Co Ltd
Priority to CN200810223932A priority Critical patent/CN101727629A/en
Publication of CN101727629A publication Critical patent/CN101727629A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a self-organization distribution business system which is called SODBS for short. The invention relates to the field of business processing of POS machine payment and online payment in the financial industry. The business systems of the financial firms are very complex, and important maintenance (such as the situations of equipment failure detection, software upgrading and the like) generally requires off-line processing. The invention aims to ensure that the system can be maintained on line on the premise of guaranteeing basic financial business function. The invention adopts a cross-platform distributed processing mode, and can not cause the failure of the integral system even when part of the software or hardware fails. By using the self-organized architecture, the failures of the software and the hardware can be automatically detected, the failures are automatically processed in an intelligent way (for example, restarting or initializing the failure node), and the alarm is sounded in a way of acousto-optic mode, mobile telephone short messages and the like. The invention has the beneficial effects that the failures of the software and the hardware can be automatically detected and properly repaired, the hardware nodes can be increased or decreased without stopping the machine, and the software can be upgraded without restarting or suspending the services. The client can persistently enjoy high-quality on-line financial services.

Description

Self-organization distribution business system
Technical field
The present invention is applied to payment and the online lasting stability transaction processing system of paying by mails that the pos machine is used in the financial business field.
Background technology
The financial industry operation system generally is made of high-performance calculation machine host (having done safety practices such as two-node cluster hot backup, automatic data backup usually) and the various terminal devices of following the specified services communications protocol (pos machine, client pc machine etc.) that link to each other with the high-performance calculation machine host.When being compared deep safeguarding (such as meeting situations such as key equipment fault or software upgrading), it needs processed offline.The objective of the invention is to make system's fault tolerant (comprising software, hard error) but and on-line maintenance.
Summary of the invention
The present invention is called self-organization distribution business system, and english name Self-Organized DistributedBanking System is abbreviated as SODBS.Native system is on the basis of having taked traditional technology such as data automated back-up, adopted the mode of Improved B eowulf (seeing that notes 1.) formula server cluster (not having fixedly management node) distributed treatment again, with the performance that many relatively inexpensive standard pc servers can be realized needing minicomputer just can reach usually, capability of fast response when having obtained high-performance that traditional double-machine standby technology can't realize and fault.And the Servers-all hardware node carries out netted connection, do not have specific key node, even realized the also unlikely entire system paralysis that causes of breaking down of part server hardware or software.Intrasystem soft, hardware fault can be detected automatically by system self, and makes automated intelligent and handle (as restarting or the initialization error node), reports to the police in modes such as acousto-optic, SMS simultaneously.Use the architecture of self-organization, the fault that has reached hardware node can detect voluntarily and repair voluntarily, and the hardware node increase and decrease need not to shut down, and software upgrading also need not to restart or beneficial effect out of service.And overall cost is cheap, the cost performance height.The client will enjoy the lasting high-quality on-line financial services of system.
Annotate 1. that Beowulf is a kind of system architecture, the system that it makes a plurality of computing machines form can be used in parallel computation.
The Beowulf system has a management node and a plurality of computing node to constitute usually.They connect by Ethernet (or other networks).Management node monitoring computing node also is the gateway and the control terminal of computing node usually.Certainly it also is the group system file server usually.In large-scale group system, because special demand, the function of these management nodes also may be shared by a plurality of nodes.
The Beowulf system is made up of modal hardware device usually, for example, and PC, Ethernet card and Ethernet switch.The Beowulf system seldom comprises the specific installation of customization.
The Beowulf system adopts those cheap and spread wide software, for example (SuSE) Linux OS, parallel virtual machine (PVM) and message passing interfaces (MPI) usually.
In recent years, top500 ranking list (www.top500.org) lining of strong supercomputer in the whole world, occupied by Beowulf formula workstation cluster more and more, by in August, 2008, the supercomputer that ranks the first no longer has been the large scale computer of expensive Blue Gene/or this giant formula of earth simulator for earth, the substitute is the Beowulf that adopts the linux system that the increases income machine workstation cluster (called after is walked cuckoo, is made up of the accurate pc server of nearly 3000 station symbols) that declines.These microcomputer workstation cluster application require the occasion of high safety, lasting stability operation in nuclear weapons surety and space project etc.
Description of drawings
Fig. 1 connects and communication structure figure general view for entire system hardware, server cluster is netted connection, every the circuit and the miscellaneous equipment on the circuit that are connected to server cluster also all are redundant, parallel type, do not have single point failure point, the interruption of the fault of any one equipment or any circuit can not have influence on the normal operation of systemic-function like this.When system detects problem automatically, when breaking down as certain node software or hardware, system will send " warning " property acousto-optic hint and SMS prompting (comprising necessary diagnostic message in the information), remind the monitor staff to note, adopt soft, the hardware mode of restarting the part of makeing mistakes automatically to attempt from mistake, to recover simultaneously.The problem invalid as automatic reparation, still report an error after can't restarting or restart as certain node, system will send " gross error " acousto-optic hint and SMS prompting (comprising necessary diagnostic message in the information), remind the monitor staff that relevant device is taked the manual type reparation.
Fig. 2 is overseer and worker's a diagram, and overseer's note is made the corner rectangle.Indicate overseer's type with a symbol T in the upper right corner of rectangle.The value of T or be " 0 ", the supervision of representative " or (or) " type, or be " A ", representative " with (and) " type is supervised.The overseer can supervise the worker or the overseer of any number.To each entity of being supervised, the overseer will know how to start, stop and restarting this entity.This information is stored among the SSRS, and SSRS i.e. " Start Stop andRestart Specification " (startup stops to restart explanation).Each overseer (except the top layer overseer of supervision in the hierarchical system) has and only has an overseer directly above it, and we claim direct upper strata overseer father (parent) for the overseer of direct lower floor.On the contrary, the overseer of certain overseer below directly is this overseer's child (children) in the supervision hierarchical system.The worker is made round rectangle by note.The worker comes parametrization by well-behaved function.
Fig. 3 be the linear layer aggregated(particle) structure or formula supervision tree, each overseer has a SSRS at its each child, observes following rule:
If an overseer is stopped by its father, this overseer will stop its all child so.
If any one child's collapse of an overseer, this overseer will be restarted this child so.
System starts by the overseer of top layer.When the overseer of top layer starts for the first time, need use SSRS1.The top layer overseer has two children, i.e. a worker and an overseer.The top layer overseer starts a worker (be by carrying out parameterized behaviour with well-behaved function WBF1), starts a sub-overseer simultaneously.The overseer of lower floor in the hierarchical system starts up in a comparable manner, and total system has just run.
Fig. 4 is AND/OR level supervision tree.One of tape character " A " expression " with " overseer, one of tape character " 0 " expression " or " overseer.One with/or the tree in the overseer should follow following rule:
If an overseer is stopped by its father, this overseer will stop its all child so.
If an overseer's a child has been collapsed, and oneself be one " with " overseer, this overseer will stop all children so, restart all children then.
If an overseer's a child has been collapsed, and oneself be one " or " overseer, this overseer will be restarted this child so.
" with " type supervision is used for the process of dependence (dependent) or relevance (co-ordinate)." with " in the type tree, the success of system's operation depends on all children's success---therefore, when any one child is collapsed, just should stop all children and restart them.
" or " type supervision can be used for coordinating the behavior of detached process (independent process)." or " in the type tree, children's behavior is considered to independent of each other, so a child can not have influence on other child, therefore child need of only makeing mistakes be restarted this child's process or child's equipment.
Fig. 5: two process X, Y and a protocol testing device C.When X when Y sends a message Q (Q is an inquiry), Y can be with a response R and a new state S as response.Value is to { R, S} just can carry out type checking with the rule in the protocol description.Protocol testing device C checks all message of dealing between X and the Y according to protocol description between X and Y.
Embodiment
We have adopted the core concept of " all are process all " in design, rely on the server hardware cluster to design SODBS.All affairs to be processed of software are divided into the task of a series of levelizations, and each task has one " strong process of isolating " to carry out, and without any shared state, can only communicate by letter by " message transmission " between the process.This " strong process of isolating " not only can describe the information process of real world more realistically, the strongest model of the reliability of protection system when also becoming the software error generation.Another thought that requires emphasis is about fault handling.Because business processing is all in " strong isolate process " one by one---we call them " worker ", have just prevented from that the process from makeing mistakes can propagate into other process.The operation conditions of business processing process is nursed by other special process---and we call them " overseer "." worker " and " overseer " forms the monitor model of a stratification, makes that total system can be made corresponding adjustment when a process broke down, and safeguards system provides service to greatest extent.For the design philosophy in behaviour in the software systems (behavioral mechanism) storehouse, then be that the concurrent processing of program and orderization processing are separated.Like this, we come out concurrent part is abstract, allow system develop to " fault tolerant " from initial " fragility ".
It all is some complexity, large-scale program that the finance class is used, though passed through tight test, the back of putting into operation still unavoidably can wrong generation, and hardware also has unexpected possibility of collapsing.It is wrong that we suppose that these programs and hardware can contain inevitably, and then seek to comprise the method that makes up reliable system under the situation of mistake at software and hardware.
One of our main focus is to comprise how to construct this problem of reliable system under the vicious situation in program self.Make up such system any programming language that is adopted is all had some special requirements.These concrete specific (special) requirements to language can be discussed here, and will show native system is how to satisfy these requirements.
These requirements can solve in programming language, also can solve in the incidental java standard library of language.We will prove when making up the fault tolerant system, and which requirement should solve in language, and which requirement can rebuild java standard library and solves.These have constituted the basis that makes up fault tolerant software systems SODBS altogether.
1 introduction
How do we remove to be programmed in the software that rational act is arranged under the condition of software or hard error? this is that we want the key problem answered.Windows operating system such as PC, all can issue some new patchs at newfound mistake the every month after the formal issue, large-scale system also is the same, often also exists many software errors when paying, yet we extravagantly hope that but they can move normally.
There is incomplete link in system, and we wish that it is reliable, and this has just proposed certain requirement to system.These requirements can be satisfied, or in the programming language that is adopted, or in the java standard library that application program is called.In this document, we can list the prerequisite intrinsic propesties of fault tolerant system that we think, we will show also how these features are satisfied in our system.
Some intrinsic propesties is satisfied in our programming language, and other then is to be satisfied in the library module that we write.Language and storehouse have constituted the basis that makes up the system of reliable software altogether, even make and have a misprogrammed, system still can move according to reasonable manner.
What this document was paid close attention to is requirement and the realizing method of fault-tolerance aspect language, storehouse and operating system of software.We have built the transmission of a kind of hard news with the C language, and platform---be a kind of new descriptive quasi-function language based on the very strong concurrent process of independence, our programming model has been extensive use of fast mistake (fail-fast) process.This technology is commonly used in the hardware platform that makes up the fault tolerant system, but but uses seldom aspect software design.This mainly is because traditional programming language does not allow different software modules to exist in non-interfering mode each other.It is current that what generally use is the programming model of multithreading, resource is shared in this model, so just caused thread to be difficult to real the isolation, just may cause a mistake in the thread can propagate in another thread, this has just destroyed the soundness of internal system.
2 framework models
Here we have proposed a software architecture that is used to make up tolerant system.Though everyone has a fuzzy notion for framework one speech, this speech is the almost well accepted definition of neither one but, and this has just caused a lot of misunderstandings.I think and as giving a definition software architecture am compared comprehensive summary:
Framework is the important decision of one group of relevant software systems organizational form; It is the selection that system is constituted cooperation behavior between element, element interface and these elements; It is a kind of synthesis mode that these structures and behavior element progressively is combined as bigger subsystem; Also be a kind of structure style, under it instructs, cooperation and synthetic tissue between these elements, element interface, element got up.
2.1 the definition of framework
From the highest abstraction hierarchy, framework is exactly " a kind of mode of thinking deeply the world ".Yet from the level of practicality, we just must be converted into the mode that we treat the world handbook and one group of rules of a practicality, and the ad hoc fashion that they can tell us how to use us to treat the world is constructed a specific system.
Our software architecture is portrayed by the following aspect:
Is what type 1. problem domain---our framework to design for the problem that solves? software architecture is general scarcely, but a certain class particular problem designs in order to solve.It is incomplete having lacked about the framework with the description that solves which class problem.
Is 2. philosophy---what software construction method principle behind? what is the core concept of framework?
Do 3. the software construction guide---how we plan a system? we need a clear and definite software construction guide collection.So our system will be write and safeguarded by a programmer team---concerning all programmer and system designer, the framework of understanding system and its potential philosophy are very important.From the angle of practicality, these knowledge show in the mode of software construction guide and are more convenient for keeping.A complete software construction guide collection comprises programming rule collection, example program and material for training or the like.
4. the parts that pre-define---design in the mode of " from one group of parts that pre-define, selecting " and will come easily more than the mode of " from the beginning design ".Our storehouse has comprised a complete ready-made parts collection (being referred to as the behaviour storehouse), and some systems commonly used can use these component constructions.For example this behaviour of sodbs server just can be used for making up the client-server system, and this behaviour of sodbs_event can be used for making up the program based on incident (event-based).
Do 5. describing mode---how we describe the interface of a certain parts? we are the communication protocol between two parts in the descriptive system how? how do we come static state and dynamic structure in the descriptive system? in order to answer these problems, we will introduce some special symbols.Some of them are used for the API of the program of describing, other then be used for describing agreement and system architecture.
Do 6. configuration mode---how we start, stop and disposing our system? can we reshuffle in system work process?
2.2 problem domain
Our system platform designs for the exploitation banking software.Financial sector has harsh demand to reliability and fault-tolerance.Financial sector needs " for good and all " operation, and soft real-time responding ability must be arranged, and when software and hardware failure takes place rational reaction will be arranged.We have summed up ten attribute specifications that financial sector need have.
1. system must be able to tackle super amount concurrent activity.
2. must in the moment of stipulating or official hour, finish the work.
3. computing machine distribution operation should be striden by system.
4. can control hardware in system.
5. software systems are often very huge.
6. system will have complicated function, for example: the characteristic conflict.
7. system should be able to run without interruption many years.
8. software maintenance (for example reshuffling etc.) should be able to be carried out under the situation of halt system not.
9. satisfy harsh q﹠r demand.
10. must provide fault tolerance, comprise the malfunctioning and software error of hardware.
We can make following analysis to the demand:
It is concurrent being born with in concurrent (concurrency)---financial sector sky, because for finance device, often has ten hundreds of users carrying out alternately with finance device simultaneously.This just means that financial sector must handle thousands of concurrent activity effectively.
Soft (soft real-time) in real time---in financial sector, a lot of operations must be finished in official hour.Wherein some operation is that strict demand is real-time, that is to say that whole operation just is cancelled if given operating in the given period do not execute.And some operation just is subjected to the supervision of timer of certain form, if timer expiry and operate and do not finish as yet then re-executes one time.
Write such system, just need to have managed effectively ten hundreds of timers.
Distributed (distributed)---financial sector is not innately distributed, and our system should create to the mode that multinode distributed system (multi-node distributed system) changes from single node system (single-node system) with a kind of being convenient to.
Hardware is mutual, and (hardware interaction)---financial sector has a large amount of peripheral hardwares to control and to monitor.This just means wants to write out device driver efficiently, and carries out the context switching between the different device drives and also want efficient.
Large software system (large software systems)---financial sector is all very huge, and this just means that the banking software system must also can work when source code reaches millions of row.
Complicated function (complex functionality)---financial sector all has complicated function.The pressure in market forces the develop and field of system will have the characteristic of many complexity.Usually, under the situation about also well not understood that influences each other between these characteristics, just must get deployment system.Run duration in system, these feature set need to make amendment in many ways and expand probably.The upgrading of function and software must " be carried out " on the spot, that is to say, can not allow system stop.
Continuous service (continuous operation)---financial sector will be designed to can many years of continuous service.This just means and is carrying out the safeguarding of software and hardware (require usually be no more than 2 hours stop time in 40 years) under the situation that system does not stop.
High-quality require (quality requirements) even---when making a mistake, financial sector also should provide acceptable service.Depositing and drawing in bank equipment particularly, reliability requirement is high.
Fault-tolerance (fault tolerance)---financial sector should be " fault-tolerant ".Promptly just know and to break down, but we must design some and can handle these wrong software and hardware infrastructure, and when breaking down, still can provide acceptable service from beginning us.
Though these demands are from financial world at first, never only be only applicable to this particular problem field.Many modern internet services (for example e-commerce server) just have closely similar list of requirements.
2.3 philosophy
How could do we enough construct the software systems that have the fault tolerant of rational act when there is mistake in software? this is the problem that this document remaining part will be answered.We provide a succinct answer earlier, in the remainder branch of this paper it are carried out refinement.
In order to construct the fault tolerant software systems that still have rational act when there is mistake in software, we have done following these things:
We become the hierarchical structure of the task that a system will finish with software organization, and each task is corresponding to one group of target, and the software with given task must attempt finishing the target relevant with this task.
All tasks sort according to complicacy.The task of top layer is the most complicated.If the intact target of top layer task all is done, total system just works well so.The task of lower level should keep system to turn round with a kind of acceptable manner, even the service that system provided discount to some extent.Easier its target of finishing of low layer task higher level task in the system.We will finish the task of top layer as possible.
Detected a mistake in the process of finishing a certain target, we will attempt correcting this mistake.When we can not correct this wrong the time, the task that we are current with immediate cancel and start a simpler task.
Write such task level and need the strong method for packing of a cover.It is isolating erroneous that we need strong method for packing.We do not think to go to write the system that mistake that a part in the sort of system takes place can have a negative impact to other parts again.
We need with a kind of can detect when attempting to finish target to be taken place wrong mode, isolate all codes of writing in order to finish a certain target.And, when we when attempting to finish a plurality of target simultaneously, we do not wish the mistake that certain part took place in the system, can propagate in the another one part of system.
Therefore, the essential problem that will solve in the process that makes up the fault tolerant software systems is exactly a fault isolation.Different programmers can write different modules, and the module that has is correct, and what have exists mistake.We do not wish that vicious module does not produce any adverse influence to there being wrong module.
For this fault isolation mechanism is provided, we have adopted the notion of process in the legacy operating system.Process provides the protection zone, and a process is made mistakes, and can not have influence on the operation of other processes.The different application that distinct program person writes is run respectively in different processes; The mistake of an application program can not have side effects to other application programs of moving in the system.
Preliminary requirement has been satisfied in this selection certainly.Yet because all processes are used with a slice CPU, same physical memory, so when different processes are fought for cpu resource or used a large amount of internal memory, still may other processes in the system be had a negative impact.Mutual conflict spectrum between process depends on the design characteristics of operating system.
In our system, process and concurrency programming are the parts of language, rather than provide by host operating system.Do like this than direct employing operating system process and have a lot of advantages:
Concurrent program can as one man operate on the different operating system---in the different specific operation system be how implementation process can not cause restriction to us.Our program run unique visible difference on different operating system and processor is exactly the processing speed of CPU and the size of internal memory.
All stationary problems and interprocess communication all should follow the characteristic of host's operating system to have not a particle of relation.
We this based on the traditional operating system process of the advance ratio of language want light weight many.In our language, creating a process is very efficiently, than the fast several magnitude of establishment of process in most of operating systems, even than all fast several magnitude of the establishment of thread in most of language.
Our system to operating system require considerably less.We have only used the very little part of service of operating system, so be our system transplantation quite simple under specific environments such as embedded system for example.
Our application program is that the concurrent process by a large amount of mutual communication makes up.We adopt this mode be because:
It provides an architectural framework---and we can organize our system with one group of process that intercoms mutually.By all processes in the system of enumerating, and define the passage that inter-process messages are transmitted, we just can become define good subassembly to system divides easily, and can realize separately and test these subassemblies.This methodology also is the tidemark that the SDL design method is learned.
Huge potential efficient---be designed to the system that realizes with many independently concurrent processes, can be implemented in easily on the multiprocessor, perhaps operate on the distributed processor network.Notice that the lifting of this efficient is potential, have only to be broken down into manyly really independently during task when application program, could produce actual effect.If there are very strong data to rely between the task, this lifting is impossible often.
Fault isolation---there is not the concurrent process of shared data that a kind of powerful failure separation method is provided.The software error of a concurrent process can not have influence on the operation of other processes in the system.In these concurrent three kinds of usages, preceding two is not its intrinsic propesties, can be obtained by different parallel (pseudo-parallel) time division ways of puppet is provided between process by certain built-in scheduler program.
The 3rd characteristic then is internal for the software of writing the fault tolerant system.Each independently activity all fully independently carry out in the process at one.These processes do not have shared data, and a mode by the message transmission communicates between the process, and this has just limited the influence that software error causes.In case share between the process any public resource arranged, such as internal memory, or to point to the pointer of internal memory, or mutexes or the like, the possibility that a software error in process destroys shared resource just exists.Because this class software error in the elimination large software system remains a difficult problem of not separating, so we think that the method for unique reality of making up large-scale reliable system becomes many independently concurrent processes to system decomposition with exactlying, and provide some mechanism for monitoring and restarting these processes.
2.4 support high concurrent system
In our system, the concurrent core roles of playing the part of, to such an extent as to it be core like this we moulded towards this term of concurrency programming (Concurrency Oriented Programming), so that this coding style and other coding styles are made a distinction.
In concurrency programming, the concurrent structure of program should be followed the concurrent structure of application itself.This coding style is specially adapted to write those and carries out mutual application program to the real world modeling or with real world.
Two major advantages that equally also have object based programming towards concurrency programming.Promptly polymorphic (polymorphism) and use predefined agreement to make to have identical message passing interface between the example of different process type.
When we became many concurrent processes to a PROBLEM DECOMPOSITION, we can allow all process responses with a kind of message (promptly polymorphic), and can allow all processes all follow identical message passing interface.A concurrent speech is meant simultaneous active set.Real world is exactly concurrent, is made up of countless simultaneous activities.See that on microcosmic our health is exactly by while moving atom, molecular.On macroscopic view, whole universe also is made up of moving galaxy of while.
When we did a simple thing, when for example driving on highway, we can aware the wagon flow at full speed of travelling at one's side, can finish this complicated task of driving but we are the same, and can just avoid potential danger without thinking.In real world, (sequential) activity of orderization is very rare.When we walked in the street, if only see that something we takes place is bound to feel inconceivable, we ran into many activities of carrying out simultaneously at expectation.
If we can not analyze to the result that simultaneous numerous incident caused and predict that we will face titanic peril so, we just can not finish as the task of this class of driving.In fact we can do those things that need handle a large amount of and photos and sending messages, and this shows that also we had a lot of perception mechanism originally, and these machine-processed let us can be understood concurrently by the light of nature just, and need not to think deeply consciously.
Yet for computer programming, it is opposite that situation but becomes suddenly.Become activity schedule an event chain that occurs in sequence to be considered to be a kind of standard, and think and say simplyr so in some sense, become proceedings one group of concurrent activity then will avoid as far as possible, and usually think can difficulty.I believe that this is because nearly all traditional programming language provides powerful support for real concurrent shortage and causes.Most programming languages all is orderization in essence; Concurrencies all in these programming languages are all only provided by underlying operating system, rather than are provided by programming language.
In this document, I have represented a such world, am wherein concurrently provided by programming language, rather than am provided by underlying operating system.I am called for short COPL the concurrent language of good support that provides is called towards concurrent language (ConcurencyOriented Language).
2.4.1 programme based on real world
We usually want to write some real world are carried out modeling or and its mutual program.It is quite easy to write such program with COPL.At first, we carry out an analysis, and it has three steps:
1. identify real concurrent activity in the activity from real world;
2. identify all message channels between the concurrent activity;
3. write all message that can in different message channels, circulate;
We come coding then.The structure of program is wanted strict and is kept consistent with the structure of problem, i.e. on all strict concurrent process that is mapped in our programming language of activity in each real world.If the mapping ratio from the problem to the program is 1: 1, we just say that program and problem are isomorphism (isomorphic).The mapping ratio is that 1: 1 this point is extremely important.Because like this can be so that the notion estrangement between the problem reconciliation minimizes.If ratio is not 1: 1, program will very fast degeneration and the indigestion that becomes.Using non-ly when concurrent programming language solves concurrent problem, this degeneration right and wrong are usually seen.Non-in concurrent programming language, in order to solve a problem, usually will force to control a plurality of independently activities by the thread or the process of same language level, this just must cause the loss of clarity, and can make program grow mistake complicated, that be difficult to reappear.
When problem analysis, we also are necessary for our suitable granularity of Model Selection.Such as, we are when writing an instantaneous communication system (instant messaging system), and we use the mode of a process of each user, rather than with the required process of user's each atom pair on one's body.
2.4.2COPL feature
COPL can be portrayed by following 6 characteristics:
1.COPL should support process.Each process should can be regarded as a self-contained virtual machine (self-contained virtual machine).
2. each process that operates on the uniform machinery should be by high degree of isolation.Fault in process can not have side effects to other processes, unless thisly made clear in program alternately.
3. each process must identify with an identifier unique, that can not copy.We are referred to as the Pid of process.
4. there is not shared state between the process.Process is only undertaken by the message transmission alternately.As long as know the Pid of process, just can send out message to it.
5. the message transmission is considered to insecure, no transmission guarantee.
6. process should detect the fault in another process, and can know the reason that breaks down.It should be noted that, the concurrency that COPL provides must be real concurrency, therefore the object that exists with the form of process all is real concurrent, message transmission between process also is real asynchronous message, and is to pretend to be by remote procedure call (remote procedure call) in many object oriented languages.
The reason that shall also be noted that fault is always incorrect.For example, in a distributed system, we may receive the notification message that process is dead, yet in fact are that a network error has taken place.
2.4.3 process isolation
Concerning understanding COP and creating the fault tolerant software, the notion of a core is exactly process isolation (isolation), and two processes moving on same computing machine should be as on two computing machines that independent operating separates physically respectively.
Yes all gives the processor that distributes a special use towards each process of concurrent program for desirable framework.But resonable being thought of as to before the reality, the fact that we have to face are a plurality of processes will be operated on same the computing machine.Yet we will be understood that still all processes all operate in physically independently on the computing machine.
Process isolation has many benefits:
1. process has the meaning of one's words of " not sharing any resource ".This point clearly because process is considered to operate in physically on the computing machine independently.
2. the message transmission is the sole mode of Data transmission between the process.Because without any shared resource, interaction data can only adopt this mode between process between the process.
3. process isolation means that the message transmission must be asynchronous.If the method for synchronization is adopted in process communication, so when a software error takes place the recipient of message accidentally, will permanent blockage live in the sender of message, destroyed the characteristic of isolation.
4. there is not shared resource, all must be so carry out the required any data of Distributed Calculation by copy.Because there is not shared resource, between process alternately can only be by the message transmission, so we can not know when message arrives recipient's (we said that the message transmission was innately insecure) yet.Know that unique method that whether message is correctly sent to sends an acknowledge message exactly and returns.
At first sight; it is very difficult writing a multiprocess system that satisfies afore mentioned rules---after all in the concurrent expansion of being done at most of order programming languages; almost provide antipodal function, such as lock, semaphore, shared data protection and reliable news transmission.Fortunately, we are proved to be correct at this opposite way---and it is unusually simple to write such system, and institute's written program just can become scalable with the slightest effort, and fault tolerant becomes.
Because all processes all require complete independence, can not exert an influence to original system so increase new process.Because whole software is exactly one group of independently set of process, need not that therefore application software is done big change and just hold more processor.
Because the reliability that message is transmitted any hypothesis in addition not,, when the message transmission makes a mistake, want too and can work so the application program that we write must equally can be worked in the message transmission and insecure the time.After we have done like this, when we need upwards stretch our system, will be recompensed.
2.4.4 the name of process
We require the name of all processes can not copy.This just means the name that can not guess a process, thereby mutual with it.We suppose that all processes all know their name, and the name of other processes of being created by them.That is to say that parent process is known the name of its subprocess.
Want to use COPL to programme, we just need a kind of mechanism to find the name of associated process.In case we have known the name of a process, we just can send out message to it.
The security of system is to link to each other closely with the acquisition methods of process name.If others does not know the name of process, just can be mutual with it without any method, this system just safety.In case widely known to the external world, the security of this system has just weakened the name of process.We are revealing that to other processes the process of name calls name and scatters problem (name distribution problem) in a controlled manner---security of system
Key just is name distribution problem.When we reveal to the another one process to a Pid, we just say that we have announced the name of this process.If the name of a process never came forth, just safety issue can not had.
Therefore, the name of obtaining process is the key factor of security.Because process name can not be copied, so as long as we can be limited in the knowledge about the process name in the scope of trusted process, our system is safe certainly just.
In many old religious faith, people believe that the mankind can arrange soul by the true name of soul, to obtain to surmount the strength of soul.In case known the true name of soul, just can obtain to surmount its strength, and can order about soul with this true name and do a lot of things.What COPL adopted is identical thought.
2.4.5 message transmission
The message transmission must be followed following rule:
1. the message transmission be when being atomizing (atomic), and the meaning is a message or whole youngster is transmitted, or not transmits.
2. the message transmission between a pair of process is orderly, and the meaning is when carrying out the transmitting-receiving of message sequence between any a pair of process, and the received order of message is identical with the order that the other side sends.
3. message can not comprise the pointer of the data structure in the sensing process---they be merely able to comprise constant and (or) Pid.Notice that the 2nd design decision is not just done any reflection to the basic meaning of one's words of the network that is used for transmitting message.The transmission network of lower floor may be resequenced message, but for arbitrary to process, these message can be carried out buffer memory and reorganization before being paid, so that the correct order of they formation.Compared with allowing message to transmit in any order obstinately, this hypothesis can be so that the application program that compose-message transmits be much easier.
We say that this message transmission has the justice that sends and pray (send and pray).We send after a piece of news, just pray it and can arrive the other side.Send the affirmation message (sometimes also be called to come and go and confirm) of returning in case receive the other side, just can acknowledge message send to the other side.
The message transmission can also be used for (synchronisation) synchronously.Suppose that we wish synchronous two process A and B.If A has sent a piece of news to B, B can only just receive this message at certain time point after A has sent this message so.This point is exactly the causal order (casual ordering) in the distributed system theory.In COPL, all inter process synchronization all are based on this simple thought.
2.4.6 agreement
Isolate between the parts, the interactive mode that adopts message to transmit, this avoids erroneous effects for protection system on framework be enough.But for the behavior of illustrative system, be not enough, for to determine the end when wrong be that to have gone out mistake also be not enough to which parts in that certain has taken place.
Up to the present, we have supposed just that single parts make mistakes, single parts or just normal operation, or in the dust just in the dust.Yet the actual situation that can take place is possible not observe parts to die, and system does not expectably work.
In order to improve our model, we have added some new things.We not only need the complete independence of parts, only transmit process interaction by message between the parts, and we also need to formulate the agreement that mutual signal post adopts between the parts.
By making communication protocol, if follow in two parts that this agreement communicates in case whom has violated agreement, we just can identify at an easy rate.We can be by to the static analysis of program---if possible, can also be compiled into run-time check in the generated code also reporting errors when losing efficacy with convenient static analysis---and come guarantee agreement to be implemented.
2.4.7COP with programmer team
Making up big software systems needs many programmers' joint efforts, sometimes even reached the hundreds of people.For so many people's work is all coordinated, normally programmer organization is become smaller development group or team, each group is responsible for the one or more logical blocks in the system.Day by day, exchange by message transmission (as email or phone) between each group, and needn't meet continually.In some cases, development group is distributed in different countries, does not always meet.We find is not only that software systems need be organized into independently parts because of a variety of causes, and each parts communicates in the mode of hard news transmission, and this also is the organizational form of large-scale software development colony.
2.5 system requirements
For seating surface to concurrent coding style, in order to make up the software that satisfies the financial sector demand, we have proposed one group of demand to the basic characteristic of system.These demands are an integral body for system---I and to be indifferent to these demands be to be satisfied by programming language, still satisfy by incidental storehouse of language or creation method.
We have 6 basic demands to the operating system and the programming language of lower floor.
R1. concurrency---our system must support concurrency.The computing cost of creating or destroy a concurrent process must be very little, even create a large amount of concurrent processes, also should not bring on a disaster.
R2. the mistake that takes place in wrong encapsulation---process necessarily can not the destruction system in other process.
R3. fault detect---must detect local unusual (what take place in the local process is unusual) and long-range unusual (what take place in the non-local process is unusual).
R4. Fault Identification---we want to identify the reason of unusual generation.
R5. code upgrade---there is certain mechanism to replace executory code, and the system that needn't stop.
R6. persistent storage---we need get off data by certain policy store, so that recover a system of having collapsed.
Also have a bit extremely important, if promptly must be efficient in order to satisfy the implementation that the demand adopts---can not create a hundreds of thousands process reliably, so the concurrency big usefulness of nothing just; Do not make and to correct fault subsequently if comprise enough information in the Trouble Report, so the Fault Identification usefulness of yet just less than nothing big.
The implementation of the demand can be varied.For example concurrency can either be provided by language primitive, also can be provided (for example Unix) by operating system.Language itself as C and Java and so on is not towards concurrent, but can utilize operating system those allow the people think that the primitive that can reach concurrency obtains concurrency.Really, concurrent program can be write by the language that itself does not have concurrency.
2.6 language needs
The programming language that is used for writing parallel system must comprise:
Primitive---language must have multiple means to come spreading of limit erroneous in encapsulation.Should get up a process isolation, can destroy other processes so as not to it.
Concurrency---language must provide a kind of light-weighted mechanism to create concurrent process, and sends message between process.The context of process switches, the message transmission must be very efficient.Concurrent process
Also must share CPU time with a kind of reasonable manner, so that the process of current use CPU is unlikely to monopolize CPU, and other process is in " being ready to " state and can not get handling.
Error-detecting primitive---language should allow another process of process monitoring, thereby whether detects monitored process because of any former thereby termination.
Location transparency---if we have known the Pid of a process, and we just should send message to it, and no matter it is this locality or long-range.
Dynamic code upgrading---the code in should the dynamic replacement runtime system.Note, because many processes may so we need a kind of mechanism, allow the code operation of existing process according to " always ", and " newly " process be moved according to amended code simultaneously simultaneously according to moving with a code.
Above-mentioned demand for programming language not only will be satisfied, and will be satisfied in a kind of rational and effective mode.When we programmed, the freedom of expression of not wishing us was subjected to the restriction such as number of processes, and we do not wish to worry to understand what happened yet when a process attempts to monopolize CPU.
The upper limit of process number should be enough big in the system, so that we need not consider the number of process when programming as a limiting factor.For example, in order to make up a payment system of handling 10,000 concurrent user sessions, we may need to create nearly 100,000 processes 3.
Above-mentioned 6 characteristics are necessary for simplifying writing of application program.If we can be mapped to the concurrent structure of the problem mode with 1: 1 on the process structure of the application program that addresses this problem, we will greatly simplify the process that the interactive component of a distribution type semantically is mapped in the program.
2.7 storehouse demand
Language is not omnipotent---many things are to be provided by our system library of exploitation.Routine library must provide:
Persistent storage---be used for the information of fault recovery by its storage.
Device driver---these programs provide a kind of and extraneous mutual mechanism.
Code upgrade---it allows our upgrade code in the operational system.
The startup of operation basis---its resolution system, stop and the error reporting problem.
Observe our routine library, write with C though be not difficult to find out them, the service that they provide all is the service that can be provided easily by operating system originally.
Because the process of SODBS is isolated from each other, only the mode with the message transmission communicates with one another, thus their behavior with regard to the process of extraordinary image operating system, the latter communicates by pipeline (pipe) and socket (socket).
Originally the numerous characteristics that can easily just be provided by operating system had been moved in the programming language, so operating system just only need provide one group of primitive of device drives just much of that.
2.8 application library
Characteristics such as persistent storage are not that the language primitive as Sodbsng provides, but are provided by basic SODBS storehouse.This basic storehouse is the precondition that makes up the application software of a complexity.More complicated application need is than higher abstract of levels such as persistent storage.In order to make up such application program, we need some ready-made software entitys to assist us to write program such as client-server formula (client-server).
The SODBS storehouse just provides a complete Design Mode (we the are referred to as behaviour) storehouse that is used for making up the fault tolerant system to us.I can introduce a minimal set in behaviour storehouse in this document, can make up the application software of fault tolerant with them, and they are:
Supervisor---monitor model behaviour.
Sodbs_server---a kind of behaviour that is used to realize client-server formula application program.
Sodbs_event---a kind of behaviour that is used for realization event processing formula application program.
Sodbs_fsm---a kind of behaviour that is used to realize finite state machine.
In the middle of these libraries, the core component that is used to write the fault tolerant application software is exactly that monitor model.
2.9 relevant work
Each software part can not be isolated from each other well, is the main cause that many popular programming languages can not be used for making up healthy and strong software.
The essence of security is to want and the program of mutual mistrust can be isolated, and is to protect basic platform not to be subjected to the destruction of these programs.It is quite difficult being isolated in the object-oriented system, because object is easy to by another name (aliased).
The another nameization of object is difficult to twine, and can not be detected in actual program, and suggestion uses protected field (protection domains) (being similar to the process of operating system) to solve this problem.
Carrying out a plurality of unique safe modes with Java written application program on same computing machine, be to open a JVM to each application program, and each JVM operates in the independent OS process.The decline greatly of the efficient aspect of the utilization of resources can be caused so again, the deterioration of aspects such as performance, retractility, program start time can be caused.So, the benefit that Java language provided just only is left portable and has been promoted programmer's yield-power.These are no doubt important, but all potential securities that language provides are not implemented fully.The fact is to exist odd difference between " language security " and " true security ".
JVM is become an execution environment that is similar to OS.Especially the process that modern OS provided is abstract, just based on the actor model of characteristic; Mutual isolation between the calculating; The termination of the audit of resource and control and resource and recovery.
In order to reach this point:
Task is shared object directly, and the sole mode of communicating by letter between the task is to use communication mechanism standard, copy type
The same with hardware system, the fault-tolerance key of software is big system is resolved into module step by step, and each module had both provided the least unit of service, also is the least unit that breaks down, and the fault of a module can not propagate into outside the module.
Process wants to reach fault-tolerance, just can not shared state be arranged with other processes; The only link of it and other processes is exactly the message by the core message systems communicate.Basic design decision in the implementation procedure of RIG6 is exactly to have adopted a kind of message specification that does not have the strictness of shared data structure.Communication informations all between user and the server all come route by the Aleph kernel.It is very flexible, reliable that this message specification is proved to be.---which character temporarily bypass language no matter let us thinks about it an independent process should possess?
A hardware system should possess 3 character if be fit to found the fault tolerant system thereon.These character are called:
1. fault is promptly stopped (Halt on failure)---and when a processor is made mistakes, should stop immediately, rather than continue to carry out the incorrect operation of possibility.
2. fault exposure character (Failure status property)---when a processor broke down, other processors in the system should obtain notice, and the reason of fault must explain oneself.
3. persistent storage character (Stable storage property)---the storer of processor should be divided into long-time memory (stable storage, still exist when processor collapses) and temporary storage (volatilestorage, processor just collapse and do not had).
The processor that possesses these character is called mistake and promptly stops processor (fail-stop processor).Its thought is exactly in case wrong the generation just there is no need to continue to have moved.The processing that makes mistakes should be stopped, and can cause bigger destruction in order to avoid continue execution.Promptly stop in the processor a mistake, state storage is in interim or lasting processor.When processor crash, all data in the temporary storage will be lost, and all data in the long-time memory still can be used after collapse.
It all is that speed is wrong that the thought that reaches fault isolation with the method for process is advocated each process, or it is just correctly moving, or it just should detect mistake, and reporting errors is also out of service.
Process reaches " speed is wrong " in the mode of protective programming (defensive programming).They carry out routine inspection to its all input parameter, intermediate result and data structure.In case the mistake of detecting is just reported this mistake and out of service immediately.Be that fast wrong software has very short detection latent period (detection latency).
These two thought essence are the same; Only one what say is hardware, one what say is software, but its cardinal principle is exactly the same.This point of should stopping as early as possible when process generation uncorrectable error is extremely important:
A mistake in software systems may cause one or more other mistakes.---being latent time---is long more to occur to the interval time that it is detected from fault, cost will be big more, because can increase like this fault is carried out the complicacy that rollback is analyzed ... in order to handle mistake effectively, we should detect mistake as soon as possible and stop.
Comprehensive above these suggestions and our original demands, we have planned that native system should possess the following character:
1. with the unit of process as mistake encapsulation---the mistake that promptly takes place in process can not have influence in the system other process.We claim this character to be strong isolate (strong isolation).
2. process or just move well-behavedly, or just happily cut off.
3. fault and failure cause should be detected by other processes.
4. do not have shared state between the process, only communicate by letter in the mode of message transmission.
Wanting a programming language or platform tool has as above these character, and can be used for making up the software systems of fault tolerant, also needs to possess some necessary precondition conditions.We will see how these character are satisfied in SODBS and programming library thereof.
Although compiler inspection and the abnormality processing that is provided by programming language are useful really, as if historically, people more are partial to reach with the mode that run-time check adds process the target of fault sealing.In case because this mode has this advantage of simplicity---a process or its processor are made mistakes, and stop it by all means! Process has just been served as the role of a kind of clean module unit, service unit, fault-tolerant unit, the unit that makes mistakes in this mode.
Fault is limited within the wrong software module of speed.
So process is because have fault-tolerance with other processes without any shared state; The sole mode that process and other processes are got in touch is exactly the message that sends by the core message system.
We can find many surprising similar if we are with these viewpoints and our present SODBS systematic comparison.Certainly also some difference---in SODBS, do not advise using the style of " protective programming ",, make this coding style also unnecessary because compiler has increased the detection of some necessity." affair mechanism " provided by the sodbs_db database.Wrong restriction and processing are then finished by " supervision tree " behaviour in the SODBS storehouse.
The thought of " ' speed wrong ' module " is corresponding to our programming guilding principle, and in our programming guilding principle, we say that process should move in strict accordance with our mode of expectation, otherwise just should cut off.Monitor layer aggregated(particle) structure in our system is corresponding to the hierarchical structure of module.Should allow software part to collapse and restart it then, can simplify fault model like this, and help guaranteeing the reliability of code.
The importance that software part is isolated from each other is also more and more recognized in the work of modern object-oriented system aspects.
3?SODBS
Here simply introduce the SODBS design philosophy.The developing instrument of SODBS belongs to message-oriented language (message-oriented language) class---and message-oriented language all is that the mode by concurrent process provides concurrency.In message-oriented language,, the substitute is between the process and reach mutual with messaging without any the object of sharing.
3.1 general view
The world outlook of SODBS can reduce some following ideas:
All are process all.
Process is isolated by force.
The generation of process and destruction all are the operations of light weight.
The message transmission is the sole mode of process interaction.
Each process has its exclusive name.
If the name of process that you know, you just can send out message to it.
Shared resource not between the process.
The non-localization of fault processing.
Process or race carefully, or stop rapidly.
Process as abstract base unit, is gone out a kind of language that is fit to write the software systems of large-scale fault tolerant because of desired design.Writing the basic problem that this class software will solve is exactly to want the propagation of limit erroneous---and process abstract just in time provides a kind of abstract border that stops error propagation.
For example, Java just propagates helpless for limit erroneous, so Java is not suitable for being used for writing " safety " application program.
If process really is (must accomplish the restriction to error propagation) of isolating, other character of process---for example can only carry out---just becoming the result of this isolation in the mode of message transmission alternately naturally so.As if the viewpoint about fault processing also not obvious.When we made up a fault tolerant system, we needed at least two computing machines independently physically.Only won't do,, all do not had with regard to what in case it has collapsed with a computing machine.The simplest fault tolerant system that we can imagine also is made up of two computing machines,
If one has collapsed, the other all working that just can take over first.Under the simplest this situation, also require fault recovery software to accomplish non-localization; Fault occurs on first computing machine, and corrects this mistake by operating in second software on the computing machine.
The world outlook of SODBS is exactly " all things on earth is a process all ", and when we also were modeled to process to real computing machine, we have just obtained fault processing should this thought of non-localization.In fact, this is a revised fact, and remote error is only handled and just can be taken place under the situation of this locality trial mis repair failure.If unusual generation is arranged, a local process should detect it and also correct the fault that it causes, in this case for every other process in the system, and imperceptible at all unusual generation.
If SODBS is regarded as a kind of concurrent language, it is very simple.Because there is not shared data structure, do not monitor (monitor) or method for synchronous, so the thing that need learn seldom.The main part of language also is the most bland part perhaps, is exactly orderization (sequential) subclass of this language.This order beggar collection can be programmed and portray with a kind of regime type, strict functional expression, and the functional expression programming is free from side effects fully.In this order beggar collection, have the minority operation that some spinoffs are arranged, but in fact these operations are optional.
Concurrent 3.2 (concurrent) programming
In SODBS, can create concurrent process by calling spawn primitive, expression formula as:
Pid=spawn(F)
Here F is that a number of parameters is 0 function, and this expression formula has been created a concurrent process to the F evaluation.Spawn returns a Process identifier (Pid), and we can visit this process by Pid.
Statement " Pid! Msg " represent a message Msg is sent to process Pid.Message can receive with receive primitive, and grammatical form is as follows:
receive
Msg1[when?Guard1]->
Expr_seq1;
Msg2[when?Guard2]->
Expr_seq2;
...
MsgN[when?GuardN]->
Expr_seqN;
...
[;after?TimeOutTime->
Timeout_Expr_seq]
end
Msg1 ... MsgN is a pattern, and pattern also may have protection.When process sends a message, this message just is put in the mailbox (mailbox) that belongs to this process.When next time, process was carried out evaluation to the receive statement, system will look over mailbox, and attempted to mate by the 1st message in the mailbox and all patterns in the current receive statement.If the message of receiving in the mailbox is not successful with any pattern match, then this message just is transferred in interim " keeping " formation, and process is suspended, and waits for a piece of news down.If match messages success, and the pattern that matches with protection also get really and talk about, the statement series of this pattern back will be successively by evaluation.Simultaneously, all also are taken back in the mailbox of process by the message of interim keeping.
The receive statement can have a selectable timeout value.If do not receive the message that can join in the overtime time limit, overtime condition following expression will be by evaluation.
3.4 registration procedure name
When we want that we need know the name of this process when a process sends message.This is very safe, still will must manage to obtain the name of this process when a given process sends message when us, and this point can be brought some inconvenience in a way.
Following expression:
register(Name,Pid)
Can create an overall process, and atom Name and succession identifier Pid are associated.So just can by call " Name! Msg " come to send message to process Pid.
3.5 fault processing
In SODBS, ask for the value of a function and necessarily have only two kinds of results: otherwise function just returns a value, otherwise it just produces one unusually.
Unusually can implicitly produce (promptly producing), also can come explicitly to produce by calling exit (X) primitive by the SODBS runtime system.Be one below and implicitly produce a unusual example, suppose we write a function as:
factorial(0)->1;
factorial(N)->N*factorial(N-1).
Evaluation factorial (10) will return a value 3628800, if but ask for the value of factorial (abc), then will produce one unusually ' EXIT ', badarith ....Unusually can cause program stop the operation carried out then do other thing---Here it is, and they are known as unusual reason.
3.5.1 it is unusual
Unusually be to be the detected a kind of abnomal condition of SODBS runtime system.The SODBS program is compiled into virtual machine instructions and is carried out by a virtual machine emulator.And the virtual machine emulator is the part of SODBS runtime system.In case emulator detects certain perplexed state, it will produce one unusually.One has 6 types unusual:
1. be worth wrong (value error)---be exactly the mistake such as " being removed " by 0.The type of passing to function parameters in this case is correct, but is worth wrong.
2. type error (type error)---the parameter type of being filled out when this class mistake is meant the built-in function that calls SODBS is incorrect.For example, it is atom_to_list (A) that a built-in function is arranged, and is an integer tabulation that atom A is converted to its ASCII character.If variables A is not an atom, runtime system will produce one unusually.
3. pattern match mistake (pattern-matching error)---this class mistake is meant attempts a data structure and some patterns are mated, and but can not find the mistake of the pattern that the match is successful.This wrong can the generation when function header is mated perhaps such as case, produces when mating in receive or the if statement.
4. the explicit exit of calling (explicit exits)---this class mistake produces when calling expression formula exit (Why) explicit, and this calls, and can to produce a Why unusual.
5. error propagation (error propagation) if---a process is received an exit signal, and it can be selected to cut off own and this exit signal is propagated the process that it is connecting to all.
6. perhaps system exception (system exception)---runtime system can exhaust or detect internal table process that terminates when inconsistent because of internal memory.This class mistake is not within programmer's range of control.
3.5.2 process connects and the overseer
In case a process is died, we wish that other process obtains notice.Recall us and said that we needed this point to write a fault tolerant system.Have dual mode can accomplish this point, we can connect or the process overseer with process.
It is a kind of mode that one group of process is condensed together that process connects, and in process connects, in any one process mistake has taken place, and other all processes are all cut off related.
Process overseer comes every other process in the surveillance with an independent process.
Process connects
Catch primitive is used for intercepting and capturing the mistake that a process takes place.We ask now, if the top layer catch of program does not manage to revise its detected mistake, and can what happened?
Answer is that this process will stop.
The reason of makeing mistakes is a unusual parameter.When a process is made mistakes, the reason of makeing mistakes will be broadcast to the what is called that it belongs to " connect collection " the every other process of (link set).Process A can concentrate the connection that B joins it by calling built-in function link (B).Connection between the process is symmetrical, that is to say, if A has been connected to B, B also has been connected to A so.
Connect and also can when process is created, create.If A creates process B by following method of calling:
B=spawn_link(fun()->...end),
Process B just has been connected to process A when creating so.This call method calls spawn earlier and and then calls link semantically being equivalent to, only these two expression formulas are carried out together, is not substep.The introducing of spawn_link primitive is also not have enough time to carry out the link statement this rare misprogrammed 9 of just dying in order to evade process in the process of creating.
If { { ' EXIT ', P, Why} will be sent to all concentrated processes of connection of process P to withdraw from signal so for ' EXIT ', the not catching exception of Why} to have produced one when process P dies.I just mention " signal ".A kind of thing that between process, transmits when signal is procedure termination.
Signal be one ' EXIT ', P, the tuple of Why} form, P is the Pid of the process that stops here, and Why is one and describes the item formula that stops reason.
Any Why of receiving will not die for the process that withdraws from signal of normal (normally).For this rule an exception is arranged: if receiving process is a system process, this process can not died so, becomes normal inter-process messages but will withdraw from conversion of signals, and is added in the mailbox of this process.(trap_exit true) becomes a general process into a system process can to call built-in function process_flag.
The typical code segment of fault that system process is handled other processes is as follows:
start()->spawn(fun?go/0).
go()->
process_flag(trap_exit,true),
loop().
loop()->
receive
{’EXIT’,P,Why}->
...handle?the?error...
end
Another one primitive exit/2 will finish this picture mosaic.(Pid Why) will send a former signal that withdraws from because of Why to process Pid to exit.Therefore the process itself of calling exit/2 can not stop, and this message can be used for the death of " camouflage " process.
But for " system process will all convert all signals to message " this point, also there is an exception: if call exit (P, kill), to send an irresistible signal (unstoppable exit) that withdraws to P, receive behind this signal process P in all desperation consequence ground termination fall.This usage of exit/2 is just useful when going whistle asking a conscious termination of process politely.
It is useful that process connects for setting up process group (group), and a process in the process group is made mistakes, and all processes all will be died.Usually we couple together the process that belongs to an application, and allow one of them process serve as the role of " overseer ".The overseer is set to catch and withdraws from signal.If there is any one process to make mistakes in the process group, all processes of other in the group except the overseer all will be died, and receive the error messages of the process in the group by the overseer, and these error messages have been described failure cause.
Process overseer
It is useful that process connects for whole process group, but for asymmetrical process concerning supervision of little use.In typical client/server model, it is exactly asymmetrical that the pass of client and server ties up to when considering fault processing.Suppose that a server process a large amount of long-time sessions (long-lived session) of a large amount of different clients, we may kill all clients when server collapses so, but we do not wish to kill server when some clients collapse.
SODBS:monitor/2 primitive is used for being provided with an overseer.If process A has evaluation:
Ref=SODBS:monitor(process,B)
When B dies because of reason Why, will send the message of a following form so to A:
{’DOWN’,Ref,process,B,Why}
The sender A and the recipient B that monitor message need not to be system process.
3.6 distributed (distributed) handles
The SODBS program can be easily from a uniprocessor platform transplantation to multi processor platform.Each complete self-contained (self-contained) SODBS system is called as a node (node).One can be run one or more SODBS nodes above the host operating system.A plurality of SODBS nodes may operate in the test that this point on the same operating system has been simplified Distributed Application.Can carry out the development﹠ testing of a distributed application program by allowing all nodes operate on the same processor.When application is come into operation, can be with the different nodes that become at the different nodes of working on the same processor on the distributed network processor.Except fixed cycle operator (timing), the working method of all operations all should be with identical in same node strictness.
Distributed treatment needs following two primitive:
Spawn (Node, Fun)---producing a processing function on a distant-end node Node is the process of Fun.
Monitor (Node)---be used for monitoring the behavior of whole node.
The monitor here is similar to link, and difference is controlled to liking the behavior of a whole node rather than certain process.
3.7 port (ports)
Port provides a kind of mechanism with extraneous communicating by letter for the SODBS program.Port can be created by calling built-in function open_port/2.Each port all has one " control process " associated therewith (controlling-process).We claim the control process to have (own) this port.All message of receiving from this port all is sent to its control process, and has only its control process just can send message to this port.
The control process of port is initialized to the process of creating this port, but this process can be changed.If P is a port, and Con is the pid of its control process, so availablely calls following expression formula and allows port do some thing:
P!{Con,Command}
The Command variable here can be got following three kinds of possible values:
{ command, Data}---Data sends to external object by port data.Data must be an io table.The io table is flattening, and all data elements all are sent to outside application program in the table.
Close---close a port.Pent port must be replied { P, the message of closed} to the control process.
Connect, and Pid1}---the control process of port is become Pid1.This port responds one must for original control process, and { after this all new informations of receiving of this port are all through sending to new control process for Port, the message of connected}.
The data of all external applications of receiving by port all will be so that { Port, { data, the message format of D}} send to its control process.
The definite form of message and this message are framings how, depend on then how port is created.
3.8 dynamic code is replaced
SODBS supports a kind of simple dynamic code to replace mechanism.On the SODBS node when an operation, all processes are all shared with a code.Can must therefore we consider if we have replaced the code of a runtime system, what happened?
In the orderization programming language, has only a control line (thread of control), so if our desired dynamic is replaced code, we only need the influence of consideration to this unique control line.In an orderization system, if we expect to change code, our in fact common way is to stop this system, replaces code, restart routine then.Yet in a real-time control system, our this system that do not wish usually to stop replaces code.In some specific real-time control system, we also never allow to turn off system and replace code, and support code is replaced so these systems need be designed to not halt system.An example of this system is exactly the X2000 satellite control system of NASA design.
The code of each module of SODBS system allows to exist two versions.If the code of a module is loaded into, the process of calling all new startups of this block code so will dynamically be connected on the latest edition of this module.If a module had been replaced afterwards, carry out the process of this block code so originally and just both can select to continue to carry out old code, also can select to carry out the code of new loading.This selection is decided by that how invoked this code is.
If code is invoked by fully-qualified name, promptly call in the mode of " ModuleName:FuncName ", so just always call the latest edition of this module, otherwise just call the current version of this module.Give an example, suppose that we have write a following service circulation:
-module(m).
...
loop(Data,F)->
receive
{From,Q}->
{Reply,Datal}=F(Q,Data),
m:loop(datal,F)
end.
When module m was called to for the first time, this module just had been loaded, in the time of for example from external call m:loop function.Because at this time the m module has only a version, so what call is loop function when front module.
Suppose that we have revised the code of module m now, recompilate and loaded this module.When we called the m:loop function in last receive statement, the code in the m module of redaction will be called so.Notice that the fresh code that is called is ensured by the programmer with the compatibility of old code.Strong suggestion is replaced all codes to call and is made all that tail calls, and like this tail calls and just needn't return until old in the code, so after a tail called, all old codes of a module just can have been deleted safely.
If we wish to continue to carry out the code of working as front module (early version), and do not switch in the code of new module, we just can write this loop circulation with the mode that non-fully-qualified name is called so, that is:
-module(m).
...
loop(Data,F)->
receive
{From,Q}->
{Reply,Datal}=F(Q,Data),
loop(datal,F)
end.
In this case, the code of the redaction of module just can not be called.
Use this mechanism process that makes to carry out new, the old code release of disparate modules simultaneously neatly.
It is noted that code exists two versions that a limitation is arranged.Again be written into a module if try for the third time, all processes of then carrying out first module will all be killed.
Except above calling convention, also have many built-in functions to be used for reaching the purpose that code is replaced.
3.9 type designations (type notation)
Are how we usages to describe this module when making up a software module? usually, we can succeed in reaching an agreement calling of one group of API (Application Programming Interface) used it.
This group API is exactly that module provides can be for one group of function of external call, and the description of the type of the requirement of the type of the input value of these functions and rreturn value.
Following Example has illustrated how to specify some CWinInetConnection types with the type designations of SODBS:
+type?file:open(fileName(),read?|write)->
{ok,fileHandle()}|{error,string()}.
+type?file:read_line(fileHandle())->
{ok,string()}|eof.
+type?file:close(fileHandle())->
true.
+deftype?fileName()=[int()]
+deftype?string()=[int()].
+deftype?fi?leHandle()=pid().
The original data type of each SODBS all has its type.These initial forms are:
Int ()---be integer type.
Atom ()---be atomic type.
Pid ()---be the Pid type.
Ref ()---be reference type.
Float ()---be the float of SODBS.
Port ()---be port type.
Bin ()---be binary type.
List type, tuple type and selection (alternation) type is the ground definition of following recursion:
If T1, T2 ..., Tn is a type, so T1, T2 ..., Tn} is exactly tuple type (tuple type).If this moment X1, X2 ..., the X1 among the Xn} is the T1 type, X2 is the T2 type ... Xn is the Tn type, we just say X1, X2 ..., Xn} be T1, T2 ..., the Tn} type.
If T is a type, [T] is exactly a list type (list type) so.If [X1, X2 ..., Xn] in all Xi all be the T type, so we just say [X1, X2 ..., Xn] and be [T] type.Notice that the type of empty table [] also is [T], wherein T is an any type.
If T1 and T2 are types, then T1|T2 selects type (alternation type) exactly.If the type of X may be T1 or T2, we are T1|T2 with regard to the type of saying X.
Can introduce new type by following symbol:
+deftype?name1()=name2()=...=Type.
Here name1, name2 ... should follow the grammer of the atom (atom) of SODBS etc. name.Type is a categorical variable, and need are write according to the grammer of the variable of SODBS.For example we can define:
+deftype?bool()=true|false.
+deftype?weekday()=monday|tuesday|wednesday|
thursday|friday.
+deftype?weekend()=saturday()|sunday().
+deftype?day()=weekday()|weekend().
Type function is write by following:
+type?functionName(T1,T2,…,Tn)->T.
Here all Ti are types.If certain categorical variable has occurred more than once in the definition of a type, all must have identical type with all variablees of the corresponding position of its definition in the example of the type so.
Be some examples below:
+deftype?string()=[int()].
+deftype?day()=number()=int().
+deftype?town()=street()=string().
+type?factorial(int())->int().
+type?day2int(day())->int().
+type?address(person())->{town(),street(),number()}.
At last, following the writing of type that also has anonymous function:
+type?fun(T1,T2,...,Tn)->T?end
Therefore, the type of map/2 just should followingly be write:
+type?map(fun(X)->Y?end,[X])->[Y].
The type designations here are Wadler﹠amp; Marlow[49] a kind of version of extremely simplifying of the type designations developed.
3.10 discuss
Here introduce a very important subclass of SODBS, be enough to be used for understanding all examples in this document at least.But I also do not answer " SODBS is the fault tolerant system? " we be sure of that answer is " just." we said once before this that the software systems of fault tolerant system must satisfy some feature.I just confirm now, and SODBS has satisfied these features really, the reasons are as follows:
Process is the basis of SODBS, so R1 satisfies.
Because the process among the SODBS is exactly wrong encapsulation unit, so R2 satisfies.If process is because the software reason stops, other process in the same SODBS node will can not be affected (certainly, unless there is process to be connected on the process that will stop, the influence between process is had a mind in this case).
If the function in the process has used wrong parameter to call, perhaps the BIF of system has used wrong parameter to call, and this process just stops immediately so.At once stop meeting the notion of fast wrong process (fail-fast process), the mistake that also meets Schneider is promptly stopped the notion of processor (fail-stop processor), also meet Renzel and must detect mistake about us, and the viewpoint of stopping as early as possible.
When a process was made mistakes, the reason of makeing mistakes can be broadcast to the current connection collection of this process, therefore satisfied R3 and R4.
R5 is satisfied by a kind of code upgrade mechanism.
R6 is not satisfied in SODBS, but is met in the storehouse of SODBS.Persistent storage can realize with dets or sodbs_db.Dets is the storage system based on disk of a unit.If a process or a node have collapsed, the data that are stored among the dets are but survived.In order to reach the purpose of protected data better, data should be stored in physically independently on two nodes, at this time can use the sodbs_db database, and it is the application program of SODBS.
I also will point out, " make mistakes and promptly stop " of Schneider (halt on failure), " error condition attribute " (Failure status property), " stable storage attribute " viewpoints such as (Stable storage property) have also been satisfied directly or indirectly by the storehouse of SODBS oneself or SODBS.
4 realization technology
Take out concurrent---say that in some sense concurrent program is more much more difficult than the orderization program.For fear of existing concurrent code in same module the code of orderization is arranged again, how we showed that with in code organization to two module, one of them all is concurrent code, and another then has only pure order code.
Clasp the world outlook of SODBS---in the world of SODBS, all things all is a process.In order to help us to clasp this viewpoint, I have introduced the thought of a kind of protocol converter (protocol converter), and it helps the programmer to set up anything all is this idea of SODBS process.
The mistake of SODBS is seen---the fault processing mode of SODBS and the difference that other language have essence.I will be illustrated in how this writes the program under the error situation among the SODBS.
Showing meaning programmes---and this is a kind of programmer can just find out the intention of programmer easily from source code a coding style, rather than guesses the intention of programmer by the analysis of code being carried out the surface.
4.1.1 the client/server model of a fault tolerant
I expand our server program now, increase the wrong code that recovers, in case make a mistake in function F/2, original server program will collapse.Hardware normally said in " fault-tolerant " speech, but here our meaning is to contain in order to the mistake in function F/2 of parameterized service device.
Evaluation is carried out in function F/2 in a catch statement, if a RPC request can cause server failing, the client that just will initiate this RPC kills.
Newer code server, we find to compare 2 slight variations with old code: the rpc code has been made into:
rpc(Name,Query)->
Name!{self(),Query},
receive
{Name,crash}->exi?t(rpc);
{Name,ok,Reply}->Reply
end.
And one section of the receive statement has made in the loop/3:
case(catch?F(Query,State))of
{’EXIT’,Why}->
log_error(Name,Query,Why),
From!{Name,crash},
loop(Name,F,State);
{Reply,Statel}->
From!{Name,ok,Reply},
loop(Name,F,Statel)
end
Let us is carefully looked at the details of these variations again, and we can find, if the evaluation to F/2 has taken place unusually in the loop of server function, three things can take place then:
1. can report that this is unusual---in our program, we have just printed this come out unusually, if but in more ripe system, we can be with in this exception record to one stable storage.
2. send a crash message to the client---when the client receives this crash message, can in client codes, produce one unusually.Because at this time CLIENT PROGRAM rerun down nonsensical probably, so this result who expects just.
3. server continues old state variable is operated.Therefore say that RPC has followed " affairs semanteme " and (transactionsemantics), that is to say, it or operation success fully, the state of server is updated, or operation failure, and the state of server is kept intact motionless.
Notice that server2.sodbs can only protect the mistake that occurs in the fundamental function of server parameterization.If server itself has been died (this is possible, is for example deliberately killed by other processes in the system), client's RPC stake has just infinitely been hung up so, is waiting for receiveing the response of arriving never always.If we also want to protect this possibility, we can write the RPC function like this so:
rpc(Name,Query)->
Name!{self(),Query},
receive
{Name,crash}->exi?t(rpc);
{Name,ok,Reply}->Reply
after?10000->
exit(timeout)
end.
Should this solution has solved a problem, but has but brought another problem: overtime setting how long we be? a better solution is to use the supervision tree, and I here do not launch to say for this.Fault has taken place in server, should not detect it by client software, and should detect it by special overseer's process of being responsible for the remediation server fault specially.
Now, we can move this with containing the server program of the VSHLR version (vshlr2) of intentional mistake as parameter.
An execution segment of this program is as follows:
>vshlr2:start().
true
2>vshlr2:find(″joe″).
error
3>vshlr2:i_am_at(″joe″,″sics″).
ok
4>vshlr2:find(″joe″).
{ok,″sics″}
5>vshlr2:find(″robert″).
Server?vshlr?query{find,″robert″}
caused?exception{badarith,[{vshlr2,handle_event,2}]}
**exited:rpc**
6>vshlr2:find(″joe″).
{ok,″sics″}
Information in unusual is enough to helping our debugged program.
The improvement that we do server program at last is exactly that let us can " in service in system " (on-the-fly) be made amendment to the program of server.
I can carry out parametrization with vshlr3 to program, and vshlr3 does not here post, and it is the same with vshlr2 basically, and have only not together: the server2 of the 3rd row makes server3 into.
Below the execution segment showed the code of how revising server program " in service " in system.1-3 is capable, and the display server program work is normal, and server3 can handle 1 and can not collapse divided by 0 this mistake, and for example the 5th row shows that operation is normal.The 6th row, we send an order, and the code of server program is changed back to version among the vshlr1.After this command execution is intact, server program can be shown in the 7th row operate as normal.
1>vshlr3:start().
true
2>vshlr3:i_am_at(″joe″,″sics″).
ok
3>vshlr3:i_am_at(″robert″,″FMV″).
ok
4>vshlr3:find(″robert″).
Server?vshlr?query{find,″robert″}
caused?exception{badarith,[{vshlr3,handle_event,2}]}
**exited:rpc**
5>vshlr3:find(″joe″).
{ok,″sics″}
6>server3:swap_code(vshlr,
fun(I,J)->vshlr1:handle_event(I,J)end).
ok
7>vshlr3:find(″robert″).
{ok,″FMV″}
The programmer who writes vshlr3 needn't know any realization details of server3 fully, needn't know that also code server can dynamically be revised under the non-stop situation of service.
The ability of replacing the server program code under the situation that does not stop server partly having satisfied " demand 8 " in the 2.2nd joint---i.e. not halt system and the software of upgrade-system.
If we recall the code of server2 and the code of the vshlr2 in the application program, we can find:
1. the code in the server program can repeat to be used for to make up the application program of many different client/server models.
2. the code of application program is simply more a lot of than the code of server program.
3. be appreciated that the code of server program, the programmer just must understand all details of SODBS models of concurrency.This just relate to name registration, process produce, to process send can not catch exit unusual,
Send and receive message.Concerning reporting unusually, the programmer also it must be understood that unusual notion, and is quite familiar to the exception handling of SODBS.
4. will write the code of application program, the programmer just only needs the simple orderization program of understanding portion---and they do not need to understand about concurrent and fault processing anything.
5. we can imagine, can cooperate operation with more and more ripe a series of server programs with the code of a application program.I had showed the server program of three versions, and we can also add increasing function in the server program, and kept the interface of server program/application program (server/application) constant.
6. what application program was given in different server program (server1, server2 or the like) infiltration is different nonfunctional characteristics (non-functional characteristics).And the functional characteristic of Servers-all program (functionalcharacteristics) all is the same (that is, that the final generation of the parameter program that input is correct all is same result); But nonfunctional characteristics is different.
7. (our said non-functional requirement is meant the behavior of system when system breaks down to the non-functional requirement of realization system, how long function evaluation needs or the like) partial code be limited within the server program, be sightless to the programmer who writes application program.
8. the details how to realize of remote procedure call (remote procedure call) is hidden in the server program module.This is with regard to meaning for the modification that will make server program from now on and can not have influence on CLIENT PROGRAM, and this point is necessary.For example, we can revise the realization details of rpc/2, and needn't revise the CLIENT PROGRAM of calling the function among the server2.The function of whole server is divided into non-functional parts (none-functional part) and a functional parts (functionalpart) is a kind of good programming practice, can brings many considerable benefits to system, just as:
1. concurrency programming is commonly referred to be difficult.In a large-scale programming team, programmer's technical ability level is often different, and expert program person should write the private server partial code so, and the still shallow programmer of experience should remove to write the applying portion code.
2. formalization method (formal method) can be applied on (more simply) applying portion code.When the SODBS code being carried out formal proof (formal verification), or the kind of design system often has problem once running into concurrency programming when carrying out type inference.If suppose correctly this hypothesis establishment of private server program, the problem of the character of proof system is changed the problem of the character of program in proper order with regard to being reduced to proof so.
3. in a system that is full of a large amount of client-servers, all server programs just can utilize with a private server program and write.This just makes the programmer understand and safeguards that many server programs are simpler.
4. private server program and applying portion program can independently be tested respectively.If inner joint keeps invariable over a long time, the two can independently improve so.
5. the applying portion code can " be inserted into " in many different private server programs, and different private servers has different nonfunctional characteristics.Under the situation with identical interface, the server that has can provide the debugging enironment of reinforcement, and the server that has can provide characteristics such as clustered, hot-swap.This point was carried out in a lot of projects, and for example the Eddie server program provides the clustered ability, and Blue-tail mail reinforcing device provides a server with hot-swapping function.
4.2 clasp the world outlook of SODBS
The world outlook of SODBS is exactly all processes, can only be undertaken mutual between the process by exchanging messages.
When our SODBS program need be followed extraneous software interactive, generally all be to write an interface routine to finish alternately, this interface routine embodies the spirit of " all are process all ", and very convenient.
Do give an example: we consider the web service end that how to realize an electronic payment server? electronic payment server is by the http protocol of definition in the RFC2616 suggestion and client communication.
From a SODBS programmer's angle, the circulation meeting of electronic payment server inside connects the process that produces to each, accepts the request from the client, and makes appropriate responsive.Program code may be as follows:
serve(Client)->
receive
{Client,Request}->
Response=generate_response(Request)
Client!{self(),Response}
end.
Here Request and Response are the item formulas (term) of SODBS, the request of expression http protocol and the response of http protocol.
Top server program is very simple, and it expects to come an independent request, makes an independent response, has just stopped connection then.
A more ripe server program also will be supported the lasting connection of HTTP/1.1 regulation, and the code of supporting this lasting connection also is very simple:
serve(Client)->
receive
{Client,close}->
true;
{Client,Request}->
Response=generate_response(Request)
Client!{self(),Response},
server(Client);
after?10000->
Client!{self(),close}
end.
The function of these 11 row is just from having finished a simple function of supporting the lasting web service end that connects in essence.
The web service end is directly with the customer interaction that produces the HTTP request, because Wu Guan details will seriously be disturbed the realization of web-server more at that rate, and makes the program structure indigestion.
Here we have used " go-between " process." go-between " process (a HTTP driver) is finished the HTTP request, replys and represents the exchange between these corresponding SODBS item formulas of asking, replying.
Whole codes of HTTP driver procedure are as follows:
relay(Socket,Server,State)->
receive
{tcp,Socket,Data}->
case?parse_request(State,Data)of
{completed,Request,Statel}->
Server!{self(),{request,Request}},
relay(Socket,Server,Statel);
{more,Statel}->
relay(Socket,Server,Statel)
end;
{tcp_closed,Socket}->
Server!{self(),close};
{Server,close}->
sodbs_tcp:close(Socket);
{Server,Response?}->
Data=format_response(Response),
sodbs?tcp:send(Socket,Data),
relay(Socket,Server,State);
{’EXIT’,Server,_}->
sodbs_tcp:close(Socket)
end.
If receive a bag by a TCP socket from the client, this bag is just resolved by calling parse_request/2.Finish when responding, the SODBS item formula of this request of expression just is sent to server.If receive the response of a server, then this response is converted form (reformat) and is sent to the client.If there is any end to stop connecting, perhaps a mistake takes place in server, and this connection will be switched off.If this process stops because of any reason, then all connections also can be turned off automatically.
Variable State is a state variable, the state of the resolver that is used for representing reentrying, the HTTP request that this resolver resolves is received.
4.3 fault processing philosophy
The fault processing of SODBS and fault processing in other most of programming languages have basic different.
SODBS can express with following several posters about the philosophy of fault processing:
Allow other process come mis repair.
The worker is unsuccessful, just dies for a righteous cause.
Appoint its collapse.
Stop the programming of defence formula.
4.3.1 allow other process come mis repair
How do we handle wrong in distributed system? for the processing hardware mistake, we need backup; And in order to handle the mistake of whole computing machine, we need two computing machines.
If computing machine 1 breaks down, computing machine 2 can be found fault and right a wrong so
If the 1st computer crashes, the 2nd computing machine can detect this fault, and attempts to repair the mistake that this fault causes.In SODBS, this way that we just are to use, only we get up computing machine and process equivalence.
If process 1 breaks down, process 2 can be found fault and right a wrong so
If Pid1 makes mistakes and Pid1 and Pid2 link together, and Pid2 is set to catch (trap) mistake, and when Pid1 made mistakes, one { message of Why} form just was sent to Pid2 for ' EXIT ', Pid1 so.
Why has described the reason of makeing mistakes.
Note,, also have one and withdraw from message { machine_died} sends to Pid2 for ' EXIT ', Pid1 if the computing machine of operation Pid1 has been died.This message is from Pid1 seemingly, but in fact comes self-operating the real-time system of the node of Pid2.
The reason of leaveing no choice but make a hard error look like a software error is, we do not want to handle mistake with two kinds of methods, a kind of process software mistake and another kind of processing hardware mistake.For notional integrality, we expect with unified mechanism.Take all factors into consideration the egregious cases of hard error again---be that the entire process device breaks down, just produced our fault processing thought:, and handle in other places of system promptly not in the place of makeing mistakes.
Therefore under any circumstance, comprising that hardware breaks down, all is to be corrected a mistake by Pid2.Here it is, and why I say " allow other process mis repairs ".
This philosophy is diverse with the orderization programming language, in the orderization programming language, except attempting to handle all mistakes in the control thread that makes a mistake, does not have other selections.In providing the order programming language of abnormality processing, the programmer comprises any code that may break down with an abnormality processing structure, attempts to encase all contingent mistakes in this structure.
Remote error is handled many benefits:
1. error handling code operates in the different control threads with the code of makeing mistakes.
2. the code of dealing with problems can not upset by processed unusual code.
3. this method can be used for distributed system, only needs error handling code is made seldom modification so the code of a single node system is transplanted in the distributed system.
4. system can make up in the single node system and test, need not to carry out big modification then and just can be deployed on the multi-node system.
4.3.2 worker and overseer
More clearly distinguish and come with handling wrong process for the process that will carry out operate as normal, we often can speak of worker (worker) and overseer (supervisor).
A process, promptly worker's process is responsible for carrying out normal work.Another process, promptly overseer's process is come testing person.If a mistake has taken place among the worker, the overseer can take measures to correct this mistake.The pleasant place of this mode is:
1. segregation of duties is very clear.The process of being responsible for doing things (worker) is not worried fault processing.
2. we can come to be responsible for fault processing specially with special process.
3. we can independently move worker's process and overseer's process on the computing machine physically.
4. tend to find that error correction code is (generic) that versatility is arranged, promptly many application programs all generally be suitable for that worker's process is then more different in response to using.
Thirdly be vital---make SODBS satisfy R3 and R4, thereby can operate in worker and overseer physically independently on the computing machine, therefore can make up the system that can contain the hard error that causes that all processes are made mistakes.
4.4 appoint its collapse
How is our fault processing philosophy applicable to our programming reality? when the programmer finds a mistake, what code should he write? our philosophy is to allow other process come mis repair, but this concerning the coding person, how it feels? answer is to appoint its collapse.My meaning is when a mistake takes place, just to allow program crashing good.What is mistake at last? with regard to programming, my said mistake promptly:
It is unusual that those runtime systems do not know how this handles yet.
Those programmers do not know the mistake how to handle yet.
If one is produced by runtime system unusually, but just predicted this before the programmer unusually, and known how to correct the unusual condition that causes, this is not a mistake just so.For example, open a non-existent file and can produce one unusually, but the programmer can not be used as mistake to it.The programmer can write code and report that this is unusual, and carries out necessary correction.
The programmer does not know how this handles yet during some wrong generation.The programmer should abide by description and programme, but often description does not say that what is to be done yet, so the programmer does not just know that what is to be done yet.An example is arranged here:
Suppose that we write a program now is a microprocessor generating run sign indicating number, description is said: load operation should 1, one store operation of return sign indicating number should return sign indicating number 2.The programmer has just been write this specification as following code:
asm(load)->1;
asm(store)->2.
Does supposing the system attempt evaluation asm (jump)---how this handles now? suppose that you are this programmer, and you have got used to writing defence formula (defensive) code, you may write so:
asm(load)->1;
asm(store)->2;
asm(X)->??????
But? what is this? which type of code can you write at this place? the situation that you run into now just runs into by 0 situation of removing as runtime system, you can not write significant code, you can do has only terminator, so you write:
asm(load)->1;
asm(store)->2;
asm(X)->exit({oops,i,did,it,again,in,asm,X}).
Compile at the SODBS compiler
asm(load)->1;
asm(store)->2.
The time, just write as you:
asm(load)->1;
asm(store)->2;
asm(X)->exit({bad_arg,asm,X}).
Defence formula coding can destroy the pure property of code, the reader of code is easy to generate obscures.And the diagnostic message of defence formula coding also may not be better than compiler automated provisioning diagnostic message.
4.5 show meaning (intentional) code
" show meaning code " be we give a kind of coding style name, this coding style makes the reader of calling program can see easily that the programmer writes the intention of one section code.
Apparent on the name of the function that the intention of code should be called from it, and should not need by to code
Structure analysis infer.Following Example has illustrated this point well:
Among the library module dict of SODBS in early days, derived the function of a lookup/2, interface is as follows:
lookup(Key,Dict)->{ok,Value}|notfound
Lookup has been used in three kinds of different contexts under this definition:
1. be used for data and obtain that (data retrieval)---the programmer may write:
lookup(Key,Dict)->{ok,Value}|notfound
Here lookup extracts clauses and subclauses with a known key (key) from dictionary (dictionary).Key should be in dictionary, otherwise is exactly a misprogrammed, so if key find will not produce one unusual.
2. be used for search (searching)---following code segment:
case?lookup(Key,Dict)of
{ok,Val}->
...do?something?with?Val...
not_found->
...do?something?else...
end.
Be the search dictionary, if we and do not know whether Key exists---key will can not be a misprogrammed not in dictionary.
3. be used to test the existence of a key---code segment:
case?lookup(Key,Dict)of
{ok,_}->
...do?something...
not_found->
...do?something?else...
end.
Whether the key Key that is an appointment of test is in dictionary.
After reading this code of thousands of row, we begin to have worried the intention of code---we ask us a problem " what the intention that the programmer writes this line code is on earth? "---after having analyzed above-mentioned three kinds of usages, our answer is that data are obtained, search and test.
In a lot of different contexts, we all need be in a dictionary key for searching.Under certain conditions, the programmer knows that the key of an appointment should be present in the dictionary, if this key not in this dictionary, then should be a misprogrammed, program should stop.Under the another kind of situation, whether the clauses and subclauses that the programmer does not know this key correspondence are in dictionary, and their program should be able to be handled key in dictionary neutralization two kinds of situations in dictionary not.
Abandon the conjecture to programmer's intention, analyze code, one group of better built-in function is:
dict:fetch(Key,Dict)=Val|‘EXIT’
dict:search(Key,Dict)={found,Val}|not_found.
dict:is_key(Key,Dict)=Boolean
This has just expressed programmer's intention compactly---does not need program is analyzed and guessed that we have clearly seen the intention of program.
Obviously fetch can realize with search as you know, and search also can realize with fetch.
If but fetch is an atomicity, then we also can write:
search(Key,Dict)->
case(catch?fetch(Key,Dict))of
{’EXIT’,_}->
not_found;
Value->
{found,Value}
end.
But this is not any good code, because we had produced one unusual (this unusually should read-me wrong) before this, has but revised mistake afterwards.
Better usage should as:
find(Key,Dict)->
case?search(Key,Dict)of
{ok,Value}->
Value;
not?found->
exit({find,Key})
end.
So just in time produce a unusual representative mistake has taken place.
4.6 discuss
Software system design is the activity of a strictness.It is difficult writing clear in structure, being intended to tangible code.A difficulty part come to select correct abstract.In order to tackle complicated situation, we have used the method for " dividing and rule " (divide andconquer), and we resolve into the simply subproblem of some to complicated problems, solve these subproblems then.
This chapter has set forth how many complicated problems are resolved into simpler subproblem.When speaking of fault processing, I have explained how " to take out " mistake, and have shown the viewpoint that program should be opened the code of " pure " and the code division of " mis repair ".
When writing a server program, I have showed two nonfunctional characteristics that how to take out server program.I have showed that a server program that can not cause server failing when wrong takes place in the fundamental function (fundamental function has defined the behavior of server) how to write one, and I have also showed behavior how to revise server under the situation of the server that do not stop.The code that system is recovered, revised when moving to mistake is two typical nonfunctional characteristics that many real systems need.Common programming language and system provide powerful support to the code of writing the behaviour that has defined, but but very poor to the non-functional support partly of program.
In most programming language; it is easy writing pure function (its value depends on the input of function definitely); but accomplish to revise the code of runtime system; or with a kind of universal mode processing mistake; or the code of protecting us is not subjected to this class thing that influences of fault that components of system as directed takes place; much more difficult, sometimes or even impossible.Therefore, the programmer has used the service that operating system provides---and operating system provides protection zone, concurrent mechanism or the like with the looks of process usually.
Say that in a sense operating system provides " being programmed the thing that language designer has been forgotten ".But in the such software platform of SODBS, operating system is almost unwanted.OS really offers just some device drivers of SODBS, does not need such as process, message transmission, scheduling, memory management or the like mechanism and OS provides.Remedying the problem that the deficiency of programming language brings with the mechanism of OS is that the low layer mechanism of operating system can not be changed easily.What is that the strategy of dispatching between the notion of process and process all can not be revised about in the operating system for example.
Process by lightweight is provided to the programmer and about the fundamental mechanism of error-detecting and processing, the author of application program just designs and realizes themselves application operating system at an easy rate, this application operating system be aim at they the feature of specific problem and specially designed.SODBS system---set of applications of writing with C---is an example in this.
5 fault tolerant systems
The devisers of finance device have spent the energy of half in software design in the detection and correction of mistake.
What is the fault tolerant system? how to write the fault tolerant system? this problem is the emphasis place of this document, also is that we understand the key that how to make up the fault tolerant system.In this chapter, we have defined the implication of we said " fault-tolerant ", and have proposed to be used for writing a kind of specific process of fault tolerant system.We begin this chapter with two citations:
If the program of a system still can correctly be carried out when logic error occurring, we just say that this system is a fault tolerant.
Want design and construct a fault tolerant system, you must understand that system when should operate as normal, when should lose efficacy, and the mistake of what type may take place.Error-detecting is a basic element of character of tolerant system.That is to say, if you know a mistake has taken place, you may be with the methods that replace the parts of makeing mistakes, adopt the method for another kind of account form or report a unusual method to reach and contain this wrong purpose.Yet you wish to avoid increasing unnecessary complicacy in order to reach fault tolerant to system, because these complicacy may cause the reduction of system reliability.
We will illustrate when detecting abnormal conditions and take place can what happened, and builds a software mechanism and detect and correct a mistake.
Herein remaining part we tell about:
A kind of strategy of fault tolerant programming---this strategy is exactly in brief when you can not remedy a mistake, abandons, and only does the simpler thing that you can accomplish at once.
Supervision level (supervision hierarchies)---be exactly stratification tissue to task.
Well-behaved function (well-behaved function)---be exactly those functions that should correctly work.Well-behaved function produces unusual we it is construed to fault.
5.1 fault tolerant system summary
In order to make system's fault tolerant, we become software organization the pending task (task) of a series of levelizations.Top task is being carried out applied logic according to certain specification.If this task can not be carried out, certain more simple task will be attempted to carry out by system.If this more simple task still can't carry out, then system will attempt carrying out a more more simple task, so analogizes.If the task of the bottom all can't be carried out in the system, so just work as system fault has taken place.Feel it is very attractive on this method is directly perceived.Its meaning is, if we can not accomplish what we wanted to do, that just does some easier accomplishing.We also attempt to organize our software, make more simple task be carried out by simple software more, and like this when task becomes simpler, possibility of success is just high more.
When task became simpler, variation had also taken place in the emphasis of operation---and the service that provides completely is provided, and we become and more pay close attention to protection system and avoid destroying.Though we become more conservative along with the reduction of task level, at all levels, but our target all is the service that an acceptor level will be provided.
When fault took place, we more paid close attention to protection system, and the definite reason of report fault---so that next we can do something to fault.This just means the persistence error log that we need certain not influenced by system crash.Under unusual environment, our system can break down, but we never should lose the information why relevant system can break down when breaking down.
In order to realize our task level, we need to " fault " (failure) this speech an accurate cognition is arranged.In SODBS, a function is carried out evaluation may cause (exception) unusually.But be not equal to mistake (error) unusually, and not every mistake all will cause fault (failure).So that we need discuss is unusual, the difference between mistake and the fault.
Unusually, so maximum difference is in the detected improper incident of which part of system between mistake and the fault, how this incident is handled, and how to be explained.We follow the tracks of when abnormal conditions have taken place in our system can what happened---the description here be " bottom-up ", promptly from detecting the wrong point that takes place at first.
The bottom in system, SODBS virtual machine have detected an internal error---and it has detected one by 0 situation about removing, perhaps a pattern match mistake or other situation.Importantly when detecting these situations, process has become nonsensical to the local follow-up evaluation that makes a mistake.So the virtual machine emulator can't continue, it has done the unique thing that can do, dishes out one exactly unusually.
At its adjacent layer, this may also may not can be hunted down unusually.The program segment of catching exception may can also may not be corrected unusual caused mistake.If mistake can successfully be repaired, so just can not cause any injury, process can return to normally.If this mistake has been hunted down, but do not correct, may produce so that another is unusual, produce this unusual process and can catch and can not catch also that this is unusual.
If one has produced unusually still and do not had " catching the processor " (catch handler), this process will break down so.The reason of fault will be propagated current all processes that are attached thereto of giving.
All processes of receiving this fault-signal may also may not can intercept and handle these signals as treating normal inter-process messages.
We have seen an improper situation that takes place in the virtual machine emulator now, are how to propagate from the bottom up in system.In the process that mistake is upwards propagated, on each point, all will attempt going to correct it.
Success of this trial possibility or failure are so we just can determine where, how to handle this mistake freely.One " be repaired " mistake can not be counted as a fault again, but this requires this error situations to want and can be predicted in advance, and will successfully carry out at the correction code of this mistake.
So far, I have seen how an improper situation produces, and how to cause unusually, and how captive this is unusually, and not by the captive process failure that unusually how to cause, how process failure is detected by other processes of system.These are our rely some available mechanism of " the task level " that realize us just.
5.2 supervision level
Recall, we are in the thought that begins to mention " task level " of this chapter, and its basic thought is:
1. carry out a task as possible.
2., then go to carry out a simpler task if you can not carry out a task.
Last overseer's process (supervisor process)---the overseer will be endowed the target that a worker (worker) attempts to reach this term of reference with each task association for we.If an improper signal that withdraws from is also sent in the failure of this worker's process, then the overseer will suppose that this task fails, and initiates certain error recovery procedure.Error recovery procedure may be restarted the worker, if perhaps restart failure then then do some more simple things.
The overseer is become the tree type of stratification to concern with the worker according to following regulation arrangement:
1. the supervision tree is the tree that the overseer forms.
2. overseer follow-up work person and overseer.
3. the worker is the example of behaviour.
4.behaviour (well-behaved function) comes parametrization with well-behaved function.
5. well-behaved function can produce when making a mistake unusually.
Here:
The supervision tree is the stratification tree that the overseer forms.Each node in the tree is responsible for monitoring the mistake that takes place in its child node.
The overseer is the process that monitors other processes in the system.By supervised to as if overseer or worker.The overseer must be able to detect and be monitored unusual that object produces, and can start, stops or restarting by surveillanced object.
The worker is the process of executing the task.
If worker's process improperly withdraws from signal (referring to the 3.5.6 joint) and terminates with one, the overseer will think a mistake has taken place so, will take measures to repair this mistake.
In our model, the worker is not a process arbitrarily, but the example of dedicated process few in number (being referred to as behaviour).
Behaviour is that its operation is by the dedicated process of the complete characterization of some call back functions.These call back functions must be well-behaved functions.
Example about behaviour is sodbs_server, and this behaviour is used to write client-server program distributed, fault tolerant.Behaviour need carry out parametrization with some WBF.
All programmers should be appreciated that how to write WBF, just can write out the distributed clients-server program of fault tolerant.This behaviour of sodbs_server provides the framework of a fault tolerant for concurrent and distributed nature.What the programmer only need be concerned about is to write WBF to come this behaviour of parametrization.
For for simplicity, consider two kinds of monitor layer aggregated(particle) structures here, is respectively linear hierarchical system (linearhierarchies) and AND/OR hierarchical tree (AND/OR hierarchy trees).In ensuing chapters and sections, I will graphically describe them.
5.2.1 diagramming
Overseer and worker can represent easily with symbol shown in Figure 2.
Overseer's note is made the corner rectangle.Indicate overseer's type with a symbol T in the upper right corner of rectangle.
The value of T or be " 0 ", the supervision of representative " or (or) " type, or be " A ", representative " with (and) " type is supervised.Type about supervision describes in detail after a while again.
The overseer can supervise the worker or the overseer of any number.To each entity of being supervised, the overseer will know how to start, stop and restarting this entity.This information is stored among the SSRS, and SSRS i.e. " Start Stop and RestartSpecification " (startup stops to restart explanation).Each overseer (except the top layer overseer of supervision in the hierarchical system) has and only has an overseer directly above it, and we claim direct upper strata overseer father (parent) for the overseer of direct lower floor.On the contrary, the overseer of certain overseer below directly is this overseer's child (children) in the supervision hierarchical system.
The worker is made round rectangle by note.The worker comes parametrization by well-behaved function.
5.2.2 linear supervision
I first talk about the linear layer aggregated(particle) structure.Fig. 3 has shown a linear layer aggregated(particle) structure of being made up of three overseers.
Each overseer has a SSRS at its each child, observes following rule:
If an overseer is stopped by its father, this overseer will stop its all child so.
If any one child's collapse of an overseer, this overseer will be restarted this child so.
System starts by the overseer of top layer.When the overseer of top layer starts for the first time, need use SSRS1.The top layer overseer has two children, i.e. a worker and an overseer.The top layer overseer starts a worker (be by carrying out parameterized behaviour with well-behaved function WBF1), starts a sub-overseer simultaneously.The overseer of lower floor in the hierarchical system starts up in a comparable manner, and total system has just run.
5.2.3 with/or the supervision level
We can be extended to one to our simple supervision hierarchical system and contain and node or or the tree of node.One of tape character " A " expression " with " overseer, one of tape character " 0 " expression " or " overseer.One with/or the tree in the overseer should follow following rule:
If an overseer is stopped by its father, this overseer will stop its all child so.
If an overseer's a child has been collapsed, and oneself be one " with " overseer, this overseer will stop all children so, restart all children then.
If an overseer's a child has been collapsed, and oneself be one " or " overseer, this overseer will be restarted this child so.
" with " type supervision is used for the process of dependence (dependent) or relevance (co-ordinate)." with " in the type tree, the success of system's operation depends on all children's success---therefore, when any one child is collapsed, just should stop all children and restart them.
" or " type supervision can be used for coordinating the behavior of detached process (independent process)." or " in the type tree, children's behavior is considered to independent of each other, so a child can not have influence on other child, therefore child need of only makeing mistakes be restarted this child's process.
Implement to specifically, our " task hierarchical system " is exactly to represent with one " supervision hierarchical system ".
In our system, we are equivalent to a series of targets to all tasks, these targets all have an invariant (invariant) if---the invariant that is associated with target is non-vacation, and we just say and have reached this target.In most of programs, whether the judgement of the value of invariant has been produced corresponding to the evaluation statement of a specially appointed function usually unusually.
Candea had done similar work before with Fox, and they once did a system based on " but (recursively-restartable) Java assembly that recurrence is restarted ".
Note that we have been divided into two classes with error-zone: repairable (correctable) mistake and (uncorrectable) mistake that can't correct.Repairable mistake is meant the mistake that those can be detected and correct in parts.The mistake that can't correct is meant that those can be detected, but does not specify it to correct the mistake of program.
Above discussion all be quite fuzzy because our what mistake at last after all not also, the mistake that also not have in practice us how to distinguish repairable mistake and can't correct.
Adding actual conditions is that most descriptions have only illustrated how this does when each parts in the system all turn round according to plan, and illustrates seldom how this does when certain specific mistake takes place---this just makes situation complicated more.Really, if a description strictness has illustrated what this does when a particular error takes place, and that perhaps just has a lot of people can say that this situation not is a mistake, but an expection characteristic of system.This just makes that the connotation of " mistake " speech is fuzzyyer.
5.3 what is a mistake?
In the time of our program run, runtime system do not know this what to be used as be mistake---it is carried out according to code by all means.Judge that it is exactly produce unusual that wrong unique sign appears in operation.
When runtime system can not determine how this does, will produce one automatically unusually.For example, when carrying out a divide operations, runtime system just may go to detect the situation of a kind of " being removed by 0 ", when this situation occurring, just produces one unusually, because runtime system does not know how this handles.One unusually always corresponding to mistake.For example, correctly tackle " being removed " this be unusual by 0 if programmer has write code, occur so this unusual just needn't be again as being mistake.
Unusually whether one have the programmer to decide corresponding to a mistake fully---and in our system, the programmer must clearly state that those functions definitely can not produce unusually in the system.
In case the behavior of an assembly is no longer consistent with its specification, just say that fault has taken place this assembly.
For our specific purpose, we are with deviating between the system action that is observed system action and expectation of an error definition.Here Qi Wang behavior is meant " behavior that the system that illustrates in the specification should have ".
In case the programmer must guarantee the behavior of system and specification and deviate from, and just can start certain error recovery procedure, and the record of this situation can be noted by certain lasting error log, so that correct in the future.
When making up real system, situation can become complicated because we do not have a complete specification.In this case, the programmer should should treat as mistake to what, and what should not have some general notions as mistake.Under the situation that lacks explicit specification, we need the mechanism of an implicit expression, meet the idea of our intuition, and promptly a mistake is " incident that causes program crashing ".
In the SODBS system, we made some well-behaved functions (Well-behaved function, WBF)---well-behaved function is to be used for the behaviour of parametrization SODBS.These functions are called by the code among the behaviour of SODBS.Unusually this just is defined as a mistake so if calling of parameterized function produced one, and an error diagnosis will be added in the error log.
loop(Name,F,State)->
receive
{From,Query}->
case(catch?F(Query,State))of
{’EXIT’,Why}->
log_error(Name,Query,Why),
From!{Name,crash},
loop(Name,F,State);
{Reply,Statel}->
From!{Name,ok,Reply},
loop(Name,F,Statel)
end
end.
Call back function F is invoked in a catch statement.If produced a unusual Why, then this to be taken as unusually be a mistake, an error message just is added in the error log.
This is a very simple example, but has illustrated the ultimate principle of fault processing among the behaviour of SODBS.For example, in this behaviour of the sodbs_server of SODBS, we have write a callback module M who is used for the parameterized service device.This module M also must derive call back function handl_call/2 except other things.
5.3.1 well-behaved function (Well-behaved functions)
Well-behaved function (WBF) is meant under the normal condition and unusual function should take place.If a WBF has taken place one unusually, this will be interpreted into a mistake unusually so.
If produced one unusually when WBF carries out evaluation to one, this WBF should reverse as possible and produce unusual environment so.If produced can not correct unusual among the WBF, the programmer should finish this function with explicit withdrawing from (exit) statement so.
Following rule is followed in writing of well-behaved function:
The rule 1---program should with specification isomorphism (isomorphic).
Program should verily be followed specification.Description allows does what, and what program just should do, even foolish thing.The program mistake in the description of must verily regenerating.
If rule 2---specification does not illustrate what this does, just produce one unusually.
This point is very important.Specification can illustrate usually what this does when certain situation takes place, and if what this does when having ignored other situations.Answer is exactly " producing one unusually " so.At this moment unfortunately many programmers wait and have given full play to their creationary guess power (guess-work), attempt to guess that what kind of intention the deviser should be able to be at that time.
If according to coming system for writing and compiling like this, what observe so will reflect mistake in the specification unusually.
Rule 3---if the enough information that do not comprise unusually that takes place makes and can so just add some extra useful informations in unusual with should mistake isolating.
When the programmer writes code, they should reach a conclusion after weighing a matter, when wrong a generation, should in error log, write what information? if error message is insufficient concerning debugging, they just should be toward the enough information of unusual middle interpolation, so that program can be debugged at next step so.
Rule 4---non-functional requirement is become and can assert (assertion) (invariant) what operation be checked the time.If assertfail just produces one unusually.
An example of this situation stops about round-robin exactly---and a misprogrammed may cause a function to enter an infinite loop, thereby causes function not withdraw from.Such a mistake should detect by asking certain function to terminate in an official hour.By time detecting,, just produce one unusually, thereby finish this function if a function does not stop in official hour.
6 make up application
Each chapter of front has been introduced a universal model writing the fault tolerant system respectively, has introduced the thought in order to " the supervision tree " of the behavior of surveillance.This chapter will transfer to the specific implementation of overseer the SODBS system from general theoretical side.
In order to illustrate the supervision principle, I have made up a simple SODBS and have used (application).This application includes overseer's process, manages three worker's processes, and these three worker's processes are sodbs_server, the example of these three kinds of behaviour of sodbs_event and sodbs_fsm.
6.1behaviour storehouse
The application of having used the SODBS platform software is all made up by many " behaviour ".Behaviour is abstract to some public programming mode, can be used as to make up piece (building blocks) and use when realizing a system with the SODBS language.The behaviour that the remaining part of this chapter will be discussed is following listed:
Sodbs_server---this behaviour is used for being structured in the server program that uses in the client/server model.
Sodbs_event---this behaviour is used for making up the event handler program.The event handler program is meant the program as error log register and so on.Event handler is the program of a flow of event of response, and the process that it needn't the subtend event handler sends incident is made and being replied.
Sodbs_fsm---this behaviour is used for realizing finite state machine.
Supervisor---this behaviour is used for realizing the supervision tree.
Application---this behaviour is as the container of the whole application program of packing.
For every kind of behaviour, I can introduce its general principle, also can introduce some specific details of its programming API, and can provide a complete example how creating the example of this behaviour.
The system of use SODBS platform construction follows the mode of following stratification:
Issue (releases)---issue is in the top of level.An issue includes all necessary informations that make up and move a system.An issue is made up of software archives (archive) (with certain form packing) and one group of rules that this issue is installed.Because release upgrade must be installed under the situation that does not stop goal systems, the process that therefore an issue is installed is very complicated.A SODBS issue is bundled to this complicacy in the independent abstraction unit.Inner an issue, comprise zero or a plurality of application.
Use (applications)---the issue of application ratio is simple, and it comprises all codes and needed all operations rules of independent application of operation, but is not total system.When an issue comprised a plurality of application, system just should organize in this manner: or guarantee abundant independence between each different application, otherwise different application all has strict stratification dependence.
The application of overseer---SODBS generally all is that the example by some overseers constitutes.
Overseer's worker supervision node of worker---SODBS.Worker's node is the example of behaviour such as sodbs_server, sodbs_event or sodbs_fsm normally.
We will explain application especially.Application begins bottom-up (bottom-up) from worker's node and makes up.I can create three worker's nodes (each one of the example of sodbs_server, sodbs_event and sodbs_fsm).Worker's node is managed by a simple supervision tree, and the supervision tree is packaged into an application.
I just talk about from worker's node.
6.1.1behaviour how the storehouse is write as
The behaviour of SODBS writes with the coding style in the example of similar the 4.1st joint.Have only a main difference, we come parametrization behaviour by function arbitrarily, but come behaviour of parametrization by the name of module.This module must derive predefined (pre-defined) function of some appointments.Concrete which function need be derived, and depends on the definition mode of behaviour.
The complete API of each behaviour has detailed document in its service manual.
Give an example, suppose that xyz is the example of this behaviour of sodbs_server, xyz.sodbs just must comprise following code so:
-module(xyz).
-behaviour(sodbs_server).
-export([init/1,handle_call/3,handle_cast/2,
handle_info/2,terminate/2,change_code/3]).
...
Xyz.sodbs must derive init/1 as implied above ... Deng six functions.Create the example of a sodbs_server, we will call:
sodbs_server:start(ServerName,Mod,Args,Options)
Here ServerName names to server, and Mod fills in atom xyz, and Args is the parameter that passes to xyz:init/1, and Options is used for the parameter of behavior of Control Server self.Options can not pass to module xyz as parameter.
Parameterized method to behaviour in the example that provides in the 4th chapter is more general than the method that SODBS adopted to a certain extent.Causing this species diversity mainly is that initial behaviour write before this method of fun is increased among the SODBS owing to historical reasons.
6.2 the principle of private server (Private Server)
We have introduced the thought of private server in the 4th chapter.Private server at first provides " sky " server, the i.e. framework that can be instantiated as server.It has clearly illustrated the relative theory of making a private server.In the SODBS system, SODBS module sodbs_server is used for constructing the server module of client-server.
Sodbs_server can be parameterized into many dissimilar servers by many different approach.
6.2.1 the API of private server
For the ease of understanding the API of sodbs_server, we look at the control stream between server program and the application.I can describe a subclass will using among the API of sodbs_server in the example of this chapter.
sodbs_server:start(Namel,Mod,Arg,Options)->Result
This:
The name of Name1=server (see and explain 1).
The name of Mod=callback module (see and explain 3).
Arg=passes to the parameter (see and explain 4) of Mod:init/1.
The set of option of Options=Control Server working method.
The value that Result=obtains by evaluation Mod:init/1 (see and explain 4).
sodbs_server:call(Name2,Term)->Result
This:
The name of Name2=server (see and explain 2).
Term=passes to the parameter (see and explain 4) of Mod:handle_call/3.
The value that Result=obtains by evaluation Mod:handle_call/1 (see and explain 4).
sodbs_server:cast(Name2,Term)->ok
This:
The name of Name2=server (see and explain 2).
Term=passes to the parameter (see and explain 4) of Mod:handle_cast/3.
Explain:
1.Namel should be as { local, Name2} or { the item formula as the global, Name2}.Start a home server and can on a single node, create a server.Starting a global server can be at one and create a server on the node that other distributed SODBS node visits pellucidly.
2.Name2 be an atom.
3.Mod should derive the following or whole functions: init/1, handle_call/3, handle_cast/3, terminate/2.These functions will be called by sodbs_server.
4.sodbs_server some function parameters can intactly pass to some function of Mod as parameter.Similarly, some formula that comprises in the rreturn value of the function of Mod also can appear in the rreturn value of some function of sodbs_server.The call back function that Mod provided should be followed following specification:
Mod:init(Arg)->{ok,State}|{stop,Reason}
This function attempts to start server:
Arg provides the 3rd parameter to sodbs_server:start/4.
{ ok, the State} meaning is that server has successfully started.The internal state of server has become state State, illustrates this moment the original call of sodbs_server:start to have been returned that { Pid is the identifier of server for ok, Pid} here.{ stop, the Reason} meaning is that startup of server has been failed, and can return { error, Reason} to calling of sodbs_server:start in this case.
Mod:handle_call(Term,From,State)->{reply,R,S1}
This function the user call sodbs_server:call (Name is called in the time of Term):
Term is an item formula (annotation of translation: this Xiang Shiwei User Defined is used to identify concrete call request) arbitrarily.
From identifies the client.
State is the current state of server.
{ it is R that reply, R, S1} make the rreturn value of sodbs_server:call/2, and the new state of server becomes S1.
Mod:handle_cast(Term,State)->{noreply,S1}|{stop,R,S1}
This function the user call sodbs_server:cast (Name is called in the time of Term):
Term is item formula arbitrarily.
State is the current state of server.
{ noreply, S1} make the state of server become S1.
{ stop, R, S1} stop server.To call when server stops Mod:terminate (R, S1).
Mod:terminate(R,S)->void
This function is called when server stops, and rreturn value is left in the basket:
R is the reason that server stops.
State is the current state of server.
6.2.2 the example of private server
Here for an example of realizing simple key-value (Key-Value) server with sodbs_server.This key-value server is realized with a callback module of being kv1.
The callback module of this behaviour of sodbs_server when the 2nd row of kv is told this module of compiler.If this module does not derive the needed correct call back function collection of sodbs_server so, compiler will produce alarm.
Client's function can call Anywhere in internal system.Call back function only can be called in the sodbs_server inside modules.
Kv:start () starts server by calling sodbs_server:start_link/4.The 1st parameter passing to sodbs_server:start_link/4 is the position of server.In our example, the position is that { meaning is that server is the process of a local registration for local, kv}, and name is kv.About the parameter of position, can also fill in many other values.Comprise that { this value is indicated with a global namespace (rather than local name) and come registrar for global, Name}.To allow the server can be with a global namespace by other any node visits in the distributed SODBS system.All the other parameters of sodbs_server:start_link/4 are: callback module name (kv), initiation parameter (argl) and one group of control and debugging option parameter ([]).If control and debugging option parameter are arranged to [{ debug, [trace, log] }] and will be opened debugger so, and Debugging message is written to a log record (log) file.
When calling sodbs_server:start_link/4, sodbs_server can call kv:init (Arg) and come the interior data structure is carried out initialization, and Arg is the 3rd parameter that offers sodbs_server:start_link/4 here.
In general, init/l should return { ok, the tuple of State} formula.
Client's number that kv derives: store/2 and lookup/1 realize by calling sodbs_server:call/2.
In inside, the realization of remote procedure call (remote procedure call) realizes by calling call back function handle_call/2.The capable needed call back function of remote procedure call of having realized server side of 23-29.The 1st parameter of handle_call is a pattern, the 2nd parameter matching must using when calling sodbs_server:call/2.The 3rd parameter is the state of server.In the ordinary course of things, handle_call should return { reply, a R, Statel}, here R is the rreturn value (this value also can become the rreturn value of sodbs_server:call/2, finally returns to the client) of remote procedure call, and Statel will become the new state value of server.(kv stop) is used for stopping server to the sodbs_server:cast that calls in the 12nd row stop/0.Sodbs_server:cast (kv, the 2nd parameter s top stop) is as the 1st parameter of handle_cast/2 in 31 row, and the 1st parameter of handle_cast/2 is the state of server.Handle_cast return stop, Reason, State} will force private server go to call kv:terminate (Reason, State).This processing goes to carry out the operation approaching one's end that any hope was carried out for chance of server before withdrawing from.When termintate/2 returned, private server can stop, and its all registered names also are removed.
In this example, we have just showed a simple example of using private server.The handbook of sodbs_server will provide all selections of the value that the parameter of the call back function of passing to sodbs_server and control function can accept.Private server can be come parametrization with being permitted different ways, so that simplify the operation as the global server on home server or the distributed SODBS meshed network.
Private server also has many built-in debugging to help means, can make things convenient for the programmer to use.The inside of the server that makes up with sodbs_server takes place one when wrong, about where wrong complete calling in the error log that track can be added to system has automatically taken place.This information for the coroner's inquest of server normally highly significant.
6.3 the principle of specific event manager (Event Manager)
Task manager behavioursodbs_event provides and has made up a kind of special framework of handling function specific to application event.Task manager can be finished following task:
Fault processing.
Alarm management.
Debugging.
Equipment control.
Task manager can provide named object, and incident can send to these named object.In 1 task manager, 0 or a plurality of event handler (event handler) can be installed.
When an incident reached a task manager, it will be handled by inner all event handlers installed of this task manager.Task manager can be handled when operation, and particularly we can install an event handler when operation, remove an event handler or replace a processor with another processor.
We see some definition first:
Incident (Event)---something of generation.
Task manager (Event Manager)---program that the processing of a certain class incident is coordinated.Task manager provides a named object, and incident can send to it.
Notice (Notification)---send the action of an incident to a task manager.
Event handler (Event Handler)---one can processing events function.Event handler must be the following function of type:
State?x?Event->State’
Task manager is safeguarded { M, the tabulation of " module × u29366X attitude " two tuples of S} form.We claim that such tabulation is module-state (MS) tabulation.
The internal state of supposing task manager can be tabulated with following MS and be represented:
[{M1,S1},{M2,S2},…]
When task manager receives an incident E, tabulation as above will become:
[{M1,S1New},{M2,S2New},…]。
Should have here ok, and SiNew}=Mi:handle_event (E, Si).
It is a general conventional finite state machine that task manager can be taken as, and only is not to safeguard a state, and what we safeguarded is " group " state and one group of state transition function.
Can be expectable as us, many interface functions are also arranged among the API of sodbs_event, be used for handling in the server { Module, State} is right.Sodbs_event is eager to excel much bigger than a bit simple introduction ours here.Can be by reading all details of understanding about the handbook of event handling aspect in the SODBS document.
6.3.1 the API of specific event manager
Task manager (sodbs_event) has been derived time array function:
sodbs_event:start(Namel)->{ok,Pid}|{error,Why}
Create a task manager.
Namel is the name (see and explain 1) of task manager.
{ ok, Pid} mean that task manager opens successfully.Pid is exactly the process PID of task manager.
{ error, Why} are the rreturn values when task manager is opened failure.
sodbs_event:add_handler(Name2,Mod,Args)->ok|Error
Add a new processor in task manager.If original state of task manager is L, so when this operates successfully, the state of task manager will become [{ Mod, S}|L], and S calls here
The value that Mod:init (Args) obtains.
Name2 is the name (see and explain 1) of task manager.
Mod is the name (see and explain 2) of callback module.
Arg is the parameter that passes to Mod:init/l.
sodbs_event:notify(Name2,E)->ok
Send an incident E to task manager.If the state of task manager be one Mi, the set collection of Si} and receive an incident E, so the state of task manager will programme Mi, the set of SiNew}, and
{ok,SiNew}=Mi:handle_event(E,Si)。
sodbs_event:call(Name2,Mod,Args)->Reply
Carry out certain operation on certain event handler in the task manager.If the status list of task manager comprise a tuple Mod, S}, will call so Mod:handle_call (Args, S).Reply is derived from the rreturn value that this calls.
sodbs_event:stop(Name2)->ok
Stop task manager.
Explain:
1. task manager is followed the naming convention identical with private server.
2. event handler must be derived some or all functions in following: init/1, handle_event/2, handle_call/3, terminate/2.
An event handler module should have following API:
Mod:init(Args)->{ok,State}
Here:
Args is from the 3rd parameter of sodbs_event:add_handler/3.
State is the original state value of present event processor.
Mod:handle_event(E,S)->{ok,S1}
Here:
E is from the 2nd parameter of sodbs_event:notify/2.
S is original state value of present event processor.
The new state value of S1 present event processor.
Mod:handle_call(Args,State)->{ok,Reply,Statel}
Here:
Args is from the 2nd parameter of sodbs_event:call/2.
State is original state value of present event processor.
Reply will become the rreturn value of sodbs_event:call/2.
Statel is the new state value of present event processor.
A simple error logging device
Mod:terminate(Reason,State)->void
Here:
Reason indicates why task manager is stopped.
State is the current state value of present event processor.
6.3.2 the example of specific event manager
We have made up a simple error logging device with sodbs_event.This error logging device can be followed the tracks of 5 nearest error messages, can also show 5 nearest error messages when receiving the report incident.
Notice that the code among the simple_logger.sodbs is pure orderization.We can notice the form of the parameter that passes to sodbs_server and the similarity of the form of the parameter that passes to sodbs_event at this.Generally speaking, pass in the different behaviour modules such as start, stop, handle_call or the like function parameters, we can design try one's best similar.
6.4 the principle of special-purpose finite state machine (Finite State Machine)
Many application (for example protocol stack) can come modeling with finite state machine (FSM).FSM can use finite state machine behaviour, and promptly sodbs_fsm writes.
A FSM can describe with one group of rule of following form:
State(S)x?Event(E)->Actions(A)x?State(S’)
...
The meaning of this rule is:
If we are in state S, an incident E has taken place, we should executable operations A so, and state transition to S '.
If we select to write a FSM, so Shang Mian state with this behaviour of sodbs_fsm
The migration rule just should be write some and be followed the SODBS function of following agreement:
StateName(Event,StateData)->
..code?for?actions?here...
{next_state,StateName’,StateData’}
6.4.1 the API of special-purpose finite state machine
Finite state machine behaviour (sodbs_fsm) has derived time array function:
sodbs_fsm:start(Namel,Mod,Arg,Options)->Result
The function of this function is the same with the sodbs_server:start/4 that had before discussed.
sodbs_fsm:send_event(Name?1,Event)->ok
Sending an incident is the FSM of Name 1 to identifier.
Callback module Mod must derive down array function:
Mod:init(Arg)->{ok,StateName,StateData}
When a FSM started, it can call init/1, and Mod:init/l should return the related data StateData of an original state StateName and some these states.Next call sodbs_fsm:send_event (..., in the time of Event), FSM can call Mod:StateName (Event, StateData).
Mod:StateName(Event,SData)->{nextstate,SNamel,SDatal}
When FSM turned round, StateName, Event and SData represented the current state of FSM.And the next state of FSM should be SNamel, and the data that next state is relevant should be SDatal.
6.4.2 the example of special-purpose finite state machine
In order to describe the application of a typical FSM, we utilize sodbs_fsm to write the program of a simple packet aggregation device (packetassembler).This packet aggregation device has 2 state: waiting and collecting.When it was in the waiting state, the information that includes packet length was received in its expectation, and this moment, it can enter the collecting state.When it was in the collecting state, many little packets were received in its expectation, and these little packets will be aggregated.When the length of all small data packets equaled total packet length, FSM can print aggregate packet, and reenters the waiting state.
We can assign the usage that this packet aggregation device is looked in one section order in the shell of SODBS:
>packet_assembler:start().
{ok,<0.44.0>}
>packet_assembler:send_header(9).
ok
>packet_assembler:send_data(″Hello″).
ok
>packet_assembler:send_data(″″).
ok
>packet_assembler:send_data(″Joe″).
Got?data:Hello?Joe
ok
Emphasize that once more sodbs_fsm is much more useful than as described herein.
6.5 special-purpose overseer's (Supervisor) principle
Up to the present, we talked about emphatically all is in order to solve some basic behaviour of typical application problem, and most of problem also can solve with behaviour such as basic client-server, event handling and FSM in the application and write.Here this behaviour of the sodbs_sup that will say is first yuan behavior (meta-behaviour), promptly is used for basic behaviour is bonded into the behaviour of a supervision system.
6.5.1 special-purpose overseer's API
Special-purpose overseer's API is extremely simple:
supervisor:start_link(Namel,Mod,Arg)->Result
This function is opened an overseer, calls Mod:init (Arg) function therebetween.
Callback module Mod must derive the init/1 function, specification as:
Mod:init(Arg)->SupStrategy
SupStrategy is an item formula of describing the supervision tree.
SupStrategy is an item the formula how workers that describe in the supervision tree are activated, stop and restart.I here do not describe in detail, and the example of an ensuing simple supervision tree has more detailed description.Full details about special-purpose overseer can be referring to the relevant portion of user manual.
6.5.2 special-purpose overseer's example
The example of our front is that an overseer has supervised three workers that each joint of front is introduced.We refer now to and see when these workers make a mistake when operation, can what happened.
The simple_sup.sodbs module definition this overseer's behavior.Beginning has been called superivsor:start_link/3 at the 7th row---this with system in other behaviour call that to be accustomed to be consistent.MODULE be one grand, be unfolded the name simple_sup that is when front module.Last parameter is set to nil.The overseer uses start_link/3 when opening the 3rd parameter removes to call the init/l function in the callback module of appointment as parameter.
Init/1 returns the data structure of a strategy that has defined the shape of supervision tree and adopted.Formula one_for_one, 5,1000} (the 11st row) tell the overseer make up one " or " type supervision tree (referring to the 5.2.3 trifle)---this is because three workers that it is supervised are that it doesn't matter each other.Numeral 5 and 1000 specified one restart frequency (restartfrequency) if---the overseer has been restarted the supervisee and has been surpassed 5 times in 1000 seconds, then overseer itself will make mistakes.
A simple overseer
Here have three by surveillanced object in our supervision tree, but only to describe the packet aggregation device be how to add in the supervision tree for I.Two other worker's adding method and the like.
13-15 is capable to have specified this worker of packet aggregation device.
The 13rd row beginning, first element in the tuple has been described the packet aggregation device and how have been supervised.Atom packet is a name (being will guarantee it is unique in the inside of this overseer example) arbitrarily, can be used to refer to the node in the supervision tree.
Because supervisee itself also is the example of the behaviour of SODBS, so they are added in the supervision tree and can be easy to.Next parameter (the 14th row) is that { A}, supervisee are used for starting the process of appointment to one 3 tuple for M, F.If the overseer will start the process that a quilt is supervised, it can go to call apply (M, F, A).
First parameter p ermanent of the 15th row says that the process of being supervised is so-called " eternal " process.An eternal process will be restarted automatically by its overseer when it is made mistakes.
One singly will not indicated by the supervision process and how to be activated, and also need write according to certain mode.For example, it must stop in good order when the overseer requires it to stop.In order to accomplish this point, must be observed so-called " standstill agreement " (shutdown protocol) by the supervision process.
(P How) stops worker's process to the overseer, and P is worker's Pid here, and How has determined how the worker is stopped by calling shutdown.Shutdown is defined as follows:
shutdown(Pid,brutal_kill)->
exit(Pid,kill);
shutdown(Pid,infinity)->
exit(Pid,shutdown),
receive
{’EXIT’,Pid,shutdown}->true
end;
shutdown(Pid,Time)->
exit(Pid,shutdown),
receive
{’EXIT’,Pid,shutdown}->
true
after?Time->
exit(Pid,kill)
end.
If How is brutal_kill, the progress of work can be killed (referring to the 3.5.6 trifle) so.
If How is infinity, the signal of a shutdown can be sent to worker's process so, and worker's process should be returned with { ' EXIT ', Pid, a shutdown} message.
If How is an integer T, worker's process need stop in given T millisecond incident so, if within the T millisecond, do not receive ' EXIT ', Pid, the message of shutdown}, this process can unconditionally be killed so.
The integer 500 of the 15th row is to close down agreement needed " closing down the time ".Want to stop one by supervision during process if the explanation overseer, it is allowed to maximum 500 milliseconds time and stops the thing to handle at present.
Parameter worker represents that by the supervision process be worker's process (we said that supervisee's process can be a worker or overseer's process in 5.2 joints), and [packet_assembler] is the tabulation (this parameter will be used when the synchronizing code alter operation) of all modules of this overseer use.
In case all things have all defined, we just can this overseer of compilation run.In ensuing demonstration script, I have started an overseer, and have triggered the several mistakes among the supervisee.The supervisee can die and the supervisee is restarted automatically.
First example is to show when taking place one in the packet aggregation device when wrong, and what can take place.We start the overseer, and check the Pid of packet aggregation device.
1>simple_sup:start().
Packet?assembler?starting
Key-Value?server?start?ing
Logger?starting
{ok,<0.30.0>}
2>whereis(my_simple_packet_assembler).
<0.31.0>
Printout shows that all servers have all got up.
We send one to specify polymerization length are the order of 3 bytes now, and next send the data of one 4 byte long:
3>packet_assembler:send_header(3).
ok
4>packet_assembler:send_data(″oops″).
packet?assembler?terminated:
{if_clause,
[{packet_assembler,collecting,2},
{sodbs_fsm,handle_msg,7},
{proc_lib,init_p,5}]}
ok
Packet?assembler?start?ing
=ERROR?REPORT====3-Jun-2007::12:38:07===
**State?machine?my_simple_packet_assembler?terminating
**Last?event?in?was″oops″
**When?State==collecting
**Data=={3,0,[]}
**Reason?for?termination=
**{if_clause,[{packet_assembler,collecting,2},
{sodbs_fsm,handle_msg,7},
{proc_lib,init_p,5}]}
The printing that this mistake causes is quite a lot of.At first be that the packet aggregation device has collapsed, see that article one mistake output just knows.And then, the overseer has detected the situation of packet aggregation device collapse and has restarted it---when restarting, this process can print " Packetassembler starting " message.At last, long, as to contain a desired useful information error messages is arranged.
This error messages has comprised the status information of FSM in the moment of collapse.It tells us, and the residing at that time state of FSM is collecting, and the data of this state relation are one 3 tuple { 3,0, [] }, and causes that the incident of FSM collapse is " oops ".These information are quite useful for the debugging of FSM.
Here, error log directly has been directed to standard output.But in the actual product system, error log is configured to be directed to persistent storage device and triggers warning device, warning device comprises: the commercially available JCJ601 type intelligence audible-visual annunciator of RS485 interface, standard available serial ports GSM modem (the built-in sim of China Mobile card), watchdog routine is all devices of (p.s.) inspection and process regularly, in case any problem occurs, promptly send warning information by serial ports, send sound and light alarm and SMS warning (including necessary diagnostic message) by audible-visual annunciator and GSMmodem, corresponding error message prompting is also arranged on the display of Control Room.For the warning mistake that system can repair voluntarily, only report to the police once.The mistake (as can't restart the hardware that reports an error) that can't repair voluntarily for system then per minute continues to report to the police once, till the technician fixes a breakdown.
We can check that the overseer has correctly been restarted the packet aggregation device, and evaluation once whereis (my_simple_packet_assembler) will return the Pid of the packet aggregation device that newly gets up.
6>whereis(my_simple_packet_assembler).
<0.40.0>
7>packet_assembler:send_header(6).
ok
8>packet_assembler:send_header(″Oknow″).
Got?data:Ok?now
ok
Use similar method, we can trigger that mistake that deliberately stays in the Key-Value server:
12>kv:store(a,1).
ack
13>kv:lookup(a).
{ok,1}
14>spawn(fun()->kv:lookup(crash)end).
<0.49.0>
K-V?server?terminating
Key-Value?server?starting
15>
=ERROR?REPORT====3-Jun-2007::12:54:10===
**Generic?server?kv?terminat?i?ng
**Last?message?in?was{lookup,crash?}
**When?Server?state=={dict,1,
16,
16,
...many?lines?removed...
**Reason?for?termination==
**{badarith,[{kv,handle_call,3},{proc_lib,init_p,5}]}
15>kv:lookup(a).
error
Note that kv:lookup (crash) must call by an interim process that is not connected to shell process (query shell).This is because the overseer starts by the mode of calling supervisor:start_link/4, so the overseer has been connected to the shell process.Directly call kv:lookup (crash) and can make overseer's process collapse yet in shell, this is not that we are desired probably.Please note also special-purpose overseer and predefined (pre-defined) behaviour are how together (together) works.Special-purpose overseer and basic behaviour are not designed to isolate separately, but be designed to replenish mutually.
Also have, the way of acquiescence is that useful information as much as possible is provided in error log, and makes great efforts to make system to be in a kind of safe state.
6.6 the principle of proprietary application (Application)
We have made up three kinds of basic behaviour so far, and they have been put in the supervision tree;
Remaining thing is all filled in all the things to an application (application) lining exactly.
An application is exactly all things containers that comprise needs when paying an application program.
The mode of using of writing is different with the mode of writing of the previous behaviour that discusses.Behaviour before will use callback module, and callback module derives some pre-defined functions.
Use and not use call back function, but show as a kind of special organization form of file in the file system, catalogue, sub-directory.The most important parts of an application is included in to be used in the descriptor file (applicationdescriptor file) (file of an expansion .app by name), and this document has been described one and used needed all resources.
6.6.1 the API of proprietary application
Application is used descriptor file with one and is described.An extension name of using descriptor file is .app.In user manual, done as giving a definition for the structure of the .app file of an application:
{application,Application,
[{description,
Description},
{vsn,
Vsn},
{id,
Id},
{modules,
[Modulel,..,ModuleN]},
{maxT,
MaxT},
{registered,
[Name1,..,NameN]},
{applications,
[Appl1,..,ApplN]},
{included_applications,[Appl1,..,ApplN]},
{env,
[{Parl,Val1},..,{ParN,ValN}]},
{mod,
{Module,StartArgs}},
{start_phases,
[{Phasel,PhaseArgsl},..,
{PhaseN,PhaseArgsN}]}]}.
All keys (key) of using in the associating inventory (application association list) all are optionally, if be left in the basket, will adopt a rational default value.
6.6.2 the example of proprietary application
We test one of them server:
1>application:start(simple,temporary).
Packet?assembler?starting
Key-Value?server?starting
Logger?starting
ok
2>packet_assembler:send_header(2).
ok
3>packet_assembler:send_data(″hi″).
ok
Got?data:hi
We can stop this application now:
4>application:stop(simple).
=INFO?REPORT====3-Jun-2007::14:33:26===
application:simple
exited:stopped
type:temporary
ok
After having stopped application, running all processes all will be turned off successively in the application.
6.7 system and issue (release)
Elaborating of this chapter is " bottom-up ".I begin with simple thing, and they are combined into bigger more complicated unit.We begin with several basic behaviour such as sodbs_server, sodbs_event and sodbs_fsm, then these basic dedicated modes have been organized in the supervision hierarchical system, then this supervision level system construction to one use in the bag.
Final step is that the application bag is building up in the issue.An issue can be packaged into a single concept units with a plurality of different application.The result can be transplanted to a few file of targeted environment.
Make up that the process that complete issue is a complexity---the current state of descriptive system is not only wanted in an issue, but also to know system before version.
Issue not only will comprise the information of software current version, but also will comprise the information releasing before of software.Especially, issue should comprise the rules from the software upgrading of previous version to the software of current version with system.This upgrading need be carried out under the situation of the system that do not stop usually.An issue also must be able to be handled failure appears installing in new software because of some reason situation.If a new issue makes mistakes, certain steady state (SS) before system also should be able to return back to.All these are all handled by the release management assembly of SODBS system.
6.7.1 software upgrading
We have done the release management system of a cover oneself, are expansions to the dedicated issue bag in the SODBS system.This is a born distributed system.We wish that the software upgrading of distributed system has " transaction-level " semanteme (transaction), promptly otherwise all node integral body of system carry out software upgrading, or the software on any node of upgrading failure all not have change.
In native system, the software of total system has two version coexistences, an early version and a redaction simultaneously.When increasing a new version software, current version has just closed into early version, and the new version that adds has become redaction.
In order to accomplish this point, all BMR software upgrade package all are to write in the mode of a kind of reversible (reversible).Promptly not only early version software dynamically can be upgraded to redaction, and can change back to early version from a redaction.
The software of all nodes of upgrading is finished by four steps.
1. in the phase one, the software of redaction is distributed to each node---and this usually can success.
2. in subordinate phase, the software on all nodes all changes to redaction from early version.If the convert failed on any node is arranged, then the node of all operation new version softwares all returns to the software of operation early version.
3. in the phase III, all nodes are all moving the software of redaction in the system, if but any mistake takes place, then all nodes all rollback remove to move the software of early version.At this time still unconfirmed of system moves new version software.
In the successful operation of new system one section for a long time after, operating personnel can " confirm " (commit) change of software.System validation (i.e. quadravalence section) will change the behavior of system.If make a mistake after confirming, system will restart with new version software so, rather than return back to early version again.
Machine-processed similar on the dark space application system X2000 of this mechanism and NASA exploitation, their software also need be realized upgrading under the prerequisite of not halt system operation.
Need to replenish be that native system has been considered has node to be in " going offline " (out-of-service) situation of state when software is upgraded in the distributed system.In this case, when this node joins in the system again, it will learn the variation of system during its off-line, carry out the software upgrading of any necessity afterwards.
6.7.2 the online replacement of hardware
Similar with system software upgrading, the increase and decrease of hardware node is also online to be carried out, and realizes by the software upgrading step.When system operators increases in system or remove a hardware node, a new software arrangements all will be proposed, system sends the form that this newly is configured to software upgrading, new available hardware resource will be known and begin and be used by all software nodes, the hardware node resource that removes will be no longer included software go to visit it.After new software arrangements comes into force, system will send the notice that software system updating finishes, and operating personnel can confirm in view of the above that adding hardware puts into operation, or close the hardware node that logically removes and move it.
6.8 discuss
The special module of realizing behaviour in the SODBS system has the expert to write.These modules all be based upon for many years financial industry and the basis of technical experience on, represented and write " best practices " that code solves some specific question.
The system that the behaviour of use SODBS makes up has very well-regulated structure, and for example, all client-servers and supervision tree all have same structure.Use behaviour, will force and adopt public structure when solving a certain problem.The application programmer only need provide the code of the semanteme of their specific question of definition, and all infrastructure is all provided automatically by behaviour.
For a new procedures person who adds the team existed, be more readily understood based on the mode of dealing with problems of behaviour.As long as they have been familiar with behaviour, which kind of situation they just can identify under to use which kind of behaviour easily.
Most in the systems programming " challenge " (tricky) all has been hidden in the realization of behaviour (these complicated problems are in fact described herein also more complex than us).If looking back at, you see client-server and event handler behaviour, you can find that all processing are concurrent, the code of message transmission or the like affairs all has been isolated in " special use " part of behaviour, and the code of " problem is relevant " all is some pure order functions that good type definition is arranged.
The concurrent program of a kind of boundary of people's high expectations during this programmes just---" difficulty " has been isolated into the good little part of some definition in the system.Most codes can enoughly have the program of the orderization of good type definition to write in the system.
In our system, behaviour solves all be quadrature problem (orthogonal problems)---for example, client-server and worker-overseer are without any relation.When making up real system, we can select and mix and use behaviour, and they are combined with different modes deal with problems.
For a software developer provides a behaviour collection little, that mix following many benefits are arranged:
It pays close attention to the long-tested technology of a group.We know that in advance single technology can work fine in realization.If design is not added restriction and activity absolute freedom fully, the deviser just may fall into temptation and produce the thing that some have unnecessary complicacy so, perhaps makes some irrealizable at all things.
It allows the deviser to construct and discuss design with a kind of accurate way.Common vocabulary when it provides one to talk about.
It has finished the feedback loop between design and the realization.Here all behaviour that said have practicality.
7SODBS introduces
Self-organization distribution financial sector (Slef-Organized﹠amp; Distributed Banking System) be in order to make up and software systems that design with security of operation, sane financial sector.The SODBS system is that design operates in so-called " middleware platform " on the common operating system.
The issue of SODBS includes the following parts:
1.SODBS compiler and developing instrument.
2. the SODBS runtime system that adapts to multiple different target environment.
3. cover some storehouses of common application widely.
4. realize one group of Design Mode of public behavior pattern.
5. be used for learning how to use some resources materials of this system.
6. a large amount of documents.
SODBS has been transplanted on many different operating systems, the system (SCO openserver, Linux, FreeBSD, Solaris, the OS-X that comprise all Unix classes ...), most Windows operating system (Windows 2000, Windows2003 ...).
The SODBS runtime system is the virtual machine of the intermediate code (intermediate code) that produces of a BEAM compiler that is used for moving by C.The supporting when local code (native code) that it also produces for the SODBS compiler simultaneously provides operation.
Many services that provided by operating system traditionally are provided the SODBS runtime system, so the SODBS runtime system supports when the operation of pure sequence language far not only is provided, and more more complex than this.Process is managed even all get the SODBS process when all being moved by SODBS---when a SODBS runtime system is being controlled ten hundreds of SODBS processes, host operating system also only can feel to have only a process in operation, and that is exactly a SODBS runtime system itself.
On the other hand, compare with other language, the compiler of SODBS is again quite simple.Compiling is just simple translation from SODBS code to a suitable virtual machine primitive usually.So for example the spawn primitive among the SODBS is translated into the independent operational code (opcode) (being the realization of spawn primitive) in the virtual machine, pays very big effort then and make efficient that the realization of operational code tries one's best.
7.1 storehouse
The issue bag of SODBS comprises by a very big storehouse collection, for the purpose of issuing, and the example that all use as SODBS in wherein all storehouses.For example issue bag V1.0 just comprises following these application:
Appmon---graphical tools monitoring and handle the supervision tree.
Asnl---support package of coding/decoding during according to compiler of ASN.1 definition and operation.
The compiler of compiler---SODBS.
Crypto---collection of functions that is used for encrypting/decrypting data and calculates eap-message digest (message digests).
Debugger---SODBS source code debugger.
Sodbs_interface---library file collection that is used for distributed SODBS node communication.
Erts---SODBS runtime system.
Et---event tracking device and some recording events data are also carried out the instrument of graph-based.
Eva---be responsible for the application that " incident and alarm " handled.
Gs---graphics system, one group of graph function that is used to make up GUI.
The idl compiler device of ic---SODBS
Inets a---http server and FTP client.
Jinterface---instrument of writing the interface of Java and SODBS.
Kernel---system is moved one of needed two basic storehouses (another is stdlib).
The realization of this storehouse include file server, code server.
Megaco---support the storehouse collection of Megaco2/H248 agreement.
Mnemosyne---a kind of data base query language that is used on the sodbs_db.
Sodbs db---DBMS (data base management system (DBMS)) with soft real-time characteristic of SODBS.
Observer---tool set that is used to follow the tracks of and observe the behavior of distributed system.
Odbc---ODBC interface that is used for SODBS visit SQL database.
The SODBS of orber---a CORBA Object Request Broker realizes.Attention: also have some other independent application, the visit to different CORBA services (as incident, notice, file transfer etc.) is provided.
Os_mon---the instrument of the resource operating position of a monitoring peripheral operation system.
Parsetool---resolve the instrument of SODBS.Comprise yecc, i.e. LALR (1) resolver maker (parsergenerator).
Pman---graphical tools of checking system state.Pman can be used for checking the SODBS node of this locality or far-end.
Runtime_tools---the needed various small functions of runtime system.
Sasl---the abbreviation in " System Architecture Support Libraries " (system architecture support storehouse).Should be with the support that comprises alarming processing (alarm handling) and release management (managingreleases).
The SODBS of snmp---Simple Network Management Protocol (Simple Network Management Protocol) realizes.Should be with comprising the instrument that a mib compiling device and some MIB write.
The security socket layer of ssl a---SODBS (secure sockets layer) interface.
" indispensability " SODBS storehouse collection that stdlib---system is moved.Another indispensable storehouse collection is kernel.
Toolbar---the graphical tools bar that can therefrom open application.
Tools---bag of forming by the various independent utility that are used to analyze and monitor the SODBS program.These application are the instrument of some Performance Evaluations (profiling), coverage rate analysis (coverage analysis), cross reference analysis (cross referenceanalysis).
Tv---" table browser " (table viewer).This table browser is the graphical application that can graphically browse the table of sodbs_db database kind.
Webtool---system that is used to manage instrument (as inets) based on webpage.
Postool---tool system that is used to manage the Pos machine.
E-bank---system that is used to pay by mails.
SODBS storehouse collection provides the tool set of a high mature, yet SODBS storehouse collection is quite numerous and jumbled.
8 case studies
A systematic research case will be showed in this part of document, and these systems dispose with the SODBS platform.This system be Beijing money and letter company limited by guarantee the SODBS-1 system---the SODBS-1 system is a jumbo financial payment systems.The SODBS-1 widespread use SODBS storehouse collection, so it provides a well proof for SODBS storehouse collection functional.
8.1 methodology
In case study, what I paid close attention to is following aspect:
What is problem domain---problem domain? does this problem belong to the design problem scope to be solved of SODBS?
The quantized character of code---write how many line codes? one total what modules? how are these codes organized? has the programmer followed design specifications? it is useful to fix design specifications? which is good? which is bad?
The evidence of fault tolerant---system's fault tolerant? the original intention of SODBS is exactly in order to make up the system of fault tolerant.Has run time error taken place and has successfully been corrected in proof really on evidence? whether enough abundant does the information that a misprogrammed produces when taking place correct for follow-up program?
How is the intelligibility of the intelligibility of system---system? be convenient to safeguard?
I ask some general problems about system property, but go to have sought some tangible prooves, come proof system moving according to our mode of expectation really.Particularly:
Is 1. whether on evidence the certain Ceng Yinwei misprogrammed of proof system and collapsing, and this mistake repaired, and system can recover from this mistake, and can move in a kind of satisfactory mode after mistake is repaired?
2. on evidence whether proof system has moved for a long time, and during software error took place and system is still firm?
Whether on evidence the code of proof system once " be in operation " (on the fly) upgrading.
4. whether prove on evidence as these mechanism of garbage reclamation play effect (just we long-play garbage retrieving system and the mistake of garbage reclamation does not take place)?
5. whether prove on evidence in the error log information for the location of mistake after makeing mistakes highly significant?
6. on evidence whether all codes of proof system have all been organized in some way, thus the details of the burse mode that most programmer uses in needn't the care system?
Is 7. on evidence whether proof supervision tree worked as scheduled with hoping?
Is 8. code organized according to " clean/dirty " style?
The the 1st, 2,5 top existence is because we wish to detect us and have played effect in practice about the thought of writing the fault tolerant system.
Article 4, check is that garbage reclamation has played effect really for necessary long playing system.
Article 6, be a measurement to the abstracting power of the behaviour of SODBS.There is too many reason to make us wish " to take out " details of ubiquitous concurrent processing under a lot of situations.The behaviour collection of SODBS is attempted so doing exactly.For the programmer who relates to our program at the beginning of, he has to what extent ignored concurrent processing is the important index whether behaviour of tolerance SODBS is fit to development system software.
How we can use explicit message transmission and process operation primitive to assess concurrent processing in their code by procedures of observation person continually can uncared-for degree.
Article 7, checked overseer's strategy whether as the expectation onset.
Article 8, whether checked us can programme according to the programming standard that provides among the appendix B.Particularly its guilding principle has been emphasized to come the importance of organization system according to " clean/dirty " mode.Here our " only " code of saying is meant the code of being free from side effects, and such code is more readily understood than " dirty " code, and " dirty " code is meant the code of spinoff.
Our total system is closely bound up with hardware operation, and this hardware operation will bring spinoff.Therefore, what we were concerned about is not to avoid spinoff, but we can be to what extent side effects limit in few module of trying one's best.Allow the code of spinoff be dispersed in the total system equably with it, not as wishing can be a large amount of side effects limit in minority " dirty " module, and most of modules are all write in the mode that has no side effect, and get up to become total system with " dirty " module combinations.Code is analyzed once whether just can disclose this organizational form feasible.
Certainly " counter-example " is also very important.We wonder the inapplicable all scenario of our pattern, and this inapplicable whether be a big problem.
8.2SODBS-1
The SODBS-1 system is a cover application example of Beijing money and the trial run of letter company limited by guarantee.Total system is made up of many telescopic modules---and each module provides 10000 online users' delivery operation capacity, and 16 modules are added up to be associated in and just can be formed the online financial payment systems of cover support 160000 people together so.
SODBS-1 designs for supporting " carrier-class " not shut down running.This system provides hardware redundancy by the hardware that repeats, and hardware can add in the system or from system under not interrupting professional situation and removes.Software must be dealt with the hardware and software fault.Because system designs for not shutting down running, it must carry out software modification under the prerequisite of EVAC (Evacuation Network Computer Model) flow not.
8.3 the quantized character of software
Showed a analysis result below to the simple statistics of SODBS software.This part simple statistics has shown that this system is on the June 5th, 2008 of state at that time.
This part analysis report is only paid close attention to some quantized characters of the SODBS code of system.The overall quantized character of system is as follows:
The quantity of total SODBS module
2248
" only " number of modules
1472
" dirty " number of modules
776
Lines of code
1136150
Total SODBS function number
57412
The number of " only " function
53322
The number of " dirty " function
4090
The ratio of " dirty " function number/lines of code
0.359%
Be to be that " only " is that " dirty " distinguished and a simple analysis doing to each module or function in the last table superficially.If it is " dirty " that any one function is arranged in the module, we just think that it is " dirty ", otherwise just think that it is " only ".In order to simplify processing, if a function has carried out receiving or sending data, perhaps called the BIF of the following SODBS, we just say that it is dirty:
apply,cancel_timer,check_process_code,delete_module,
demonitor,disconnect_node,erase,group_leader,halt,link,
load_module,monitor_node,open_port,port_close,port_command,
port?control,process_flag,processes,purge_module,put,register,
registered,resume_process,send_nosuspend,spawn,spawn_link,
spawn_opt,suspend_process,system_flag,trace,trace_info,
trace_pattern,unlink,unregister,yield.
Why distinguishing like this, is all potential danger can be arranged because called the code segment of these BIF.Please note that we have descended a special definition of simplifying for " dirty module " here.As if should give " dirty module " recursively definition instinctively, if " dirty function " in " dangerous " BIF that promptly any function call arranged in module or another module, then this module is " a dirty module ".Unfortunately, judge that nearly all module all will be judged to " dirty module " in the system so if define according to this.
Reason is, if you add up the transitive closure (transitiveclosure) of all function calls that certain module is derived, you will find that in fact this transitive closure included nearly all module of system.Why so big this transitive closure is, should be attributed to that number of modules " leakage " has all taken place (leakage) in the C language library.
We think simply, all modules are all write finely and have been passed through test, if and a module includes spinoff really, can notice also when writing this module that so this spinoff can not leak out this module, thereby the code that calls this module is impacted.
According to this definition, 65% module just all is " a clean module ".Owing to just be regarded as " dirty module " as long as contain " dirty function " this module in the module, look at that so the ratio of " clean function "/" dirty function " is perhaps more interesting.As long as taken place in the function once the calling of impure BIF, this function just is regarded as " dirty function ".In the function level, we can see that 92% function all is to write in the mode that has no side effect.
It shall yet further be noted that in all codes include 3067 " dirty functions " altogether, the dirty function number that just per 1000 line codes comprise is no more than 4.
From the result, improved place is also demanded in the existing place that is worth praising and honouring of the distribution situation of dirty function urgently.Good news be 95% dirty function all appear at only a few 1% module in, bad news is that the dirty function that all includes minute quantity in a large amount of modules is arranged.For example, have and have only 1 dirty function in 200 modules, have to comprise 2 dirty functions in 156 modules, or the like.
Having a bit very interesting in these data, is exactly that we do not work hard for " clean level " that reach code systemicly.Therefore " the original style and features " of this programming just in time catered to a kind of coding style, and promptly the minority module comprises a large amount of spinoffs, and a large amount of module comprises few spinoff.
The distribution of dirty function
SODBS programming standard is also actively supported this coding style, and intention will allow more experienced programmer write and test the code that those comprise spinoff exactly.Based on the observation to the code of SODBS, we have an idea, and that is exactly to define which module clearly to allow to comprise spinoff, as a kind of mandatory requirement of quality control.
If we look at a bit deeply that again we can find as next one order to the number of times of the use of the primitive of meeting introducing spinoff:
put(1743),apply(1532),send(1345),receive(634),erase(235),
process_flag(262),spawn(304),unlink(132),register(172),
spawn_link(134),link(121),unregister(27),open_port(16),
demonitor(13),processes(11),yield(10),halt(8),registered(9),
spawn_opt(5),port_command(4),trace(3),cancel_timer(3),
monitor_node(2).
Primitive is exactly put with getting the most widely, has been used altogether 1743 times.
From this usage statistics we as can be seen, the many primitive on our the SODBS primitive " blacklist " never were used to.Use to such an extent that can to bring the primitive of spinoff the most widely be exactly put---whether really can have side effects and then depend on using method it.Wherein a kind of usage widely is exactly to assert a global property that is used for the process of debugging purpose with put, and this usage is safe basically, though there is not the routine analyzer of robotization can prove this fact.
Real adventurous spinoff is those primitive that change the concurrent structure of application program, so those have used the module of primitive such as link, unlink, spawn, spawn_link carefully to check.
More dangerous code is the code that has called halt or processes primitive---I be sure of that such code was necessarily checked very carefully.
8.3.1 system architecture
The code of SODBS is organized with SODBS supervision tree, therefore can infer that the whole code organization structure of SODBS is a big supervision tree according to the shape of these trees.Internal node itself of this supervision tree is some overseer's nodes, and leaf node then all is the example of behaviour of SODBS or the special relevant process of application.
The supervision tree of SODBS system has 141 nodes, has used 191 examples of the behaviour of SODBS.The number of the example of every kind of behaviour is as follows:
sodbs_server(162),sodbs_event(56),supervisor(30),sodbs_fsm(17),
application(8).
As seen the use of sodbs_server is maximum, and one has the example of 162 private servers, and the example number of sodbs_event occupy next.More significant is that in fact needed joint act is quite few.
Client-server abstract (being sodbs_server) is so useful, to such an extent as to the general object in the total system has 63% to be the example of client-server behavior.
Concentrate in the SODBS storehouse, overseer opens an application by a function in so-called child_spec (subprocess explanation) information of the process calling it and supervise.Include many information in " subprocess explanation " (child specification), wherein { the Args} tuple just makes to be used for indicating how to open process that a quilt supervises for Mod, Func.It is special-purpose fully opening a method by the supervision process, because the overseer can open a supervisee according to any one function.In the case of SODBS, the specificity of this method does not embody fully, but has only used a kind of by in the method for the process of supervising of three kinds of unlatchings in all monitor layer aggregated(particle) structures.In these three kinds of methods, a kind of dominant position that occupies is fully arranged, be applied in other all supervision trees that remove in 3 supervision trees.
The architect of SODBS has defined a master supervisor, and it can carry out parametrization with many standardized modes.All overseers of SODBS also are packaged as a common SODBS and use, and their behavior is described in a so-called .app file.Analyze all app files of SODBS, can give us a good general cognition about the software static structure of SODBS.
The software one of SODBS has 172 .app files.These 172 document presentation 11 supervision trees independently.Great majority in these supervision trees are very flat, not very complicated structure.
A simple method of showing this structure is exactly the tree type figure that shows this structure of drawing with simple ASCII character.For example, be exactly the tree type figure of the one tree of processing " Standby " business in top said 11 top-level tree here.
|--chStandby?Standby?top?application
||--stbAts?Standby?parts?of?ATS?Subsystem
||--aini_sb?Protocol?termination?of?AINI,...
|||--cc_sb?Call?Control.Standby?role
|||--iisp_sb?Protocol?termination?of?IISP...
|||--mdisp_sb?Message?Dispatcher,standby?role
|||--pch_sb?Permanent?Connect?ion?Handl?ing...
|||--pnni_sb?Protocol?termination?of?PNNI...
|||--reh_sb?Standby?application?for?REH
|||--saal_sb?SAAL,standby?application.
|||--sbm_sb?Standby?Manager,start?standby?role
|||--spvc_sb?Soft?Permanent?Connection...
|||--uni_sb?Protocol?termination?of?UNI,...
||--stbSws?Standby?applications-SWS
|||--cecpSb?Circuit?Emulation?for?SODBS,...
|||--cnh_sb?Connection?Handling?on?standby?node
|||--tecSb?Tone&Echo?for?SODBS,SWS,...
As we see, the structure of this tree is quite simple, flat and shallow.Have only 2 one-level child nodes in the tree, and the structure of the overseer below them is flat fully.
It should be noted that such data presented has just shown the tissue of overseer's node.The process of reality as the child node of leaf node in the tree does not show, the supervision type (promptly " and with " the type supervision still " or " type supervises) do not show yet.
Why organize tree flatly rather than multilayer? its reason has reflected an experience of obtaining from practice, in simple terms, and " (cascading) of multilayer restarts frequent inefficacy " exactly.
Chief Software Architect of SODBS is closed plumage and is found, go to restart a process of makeing mistakes with same parameter and often be fine, but after this simple restarting process failure, restart (last layer of promptly restarting it) of multilayer does not but often prove effective.
Very interesting, we have observed most hardware fault all is instantaneous, and state that can be by reinitializing hardware retry operation is then corrected.So our conjecture this point also is the same for software:
I guess and also have similar phenomenon in the software---a lot of product bugs all are very delicate.If program state is reinitialized, the operation of the previous failure of retry then, this operates in for the second time and has just often no longer failed.
The generalized method of the error handling processing model that proposes in this document---promptly adopts a simple as far as possible strategy---and has just partly been adopted when breaking down.Why be that part adopts; be to have a mind to design, not equal to be to gather accidentally---SODBS storehouse collection itself is that provide with interface file system with and just pay close attention to the integrality of wanting protection system when breaking down with the interface of the system-level like this service of socket at the beginning of writing.So, if the control process of file or socket because of any former thereby when stopping, this file or socket will be closed automatically.
The protection level that is provided by SODBS storehouse service will provide " simpler service class " (simpler level ofservice) automatically, and this is the purpose of our fault-tolerant processing model just also.
8.4 discuss
In distributed system, even in we talk about, be to need to revise about the understanding of fault.The fault that we talk about total system has not more had any meaning (because this thing has seldom taken place)---and what we should talk about is the measurement that reduces for service quality.
Software systems in our case study are so reliable, tend to think that these systems do not have mistake to such an extent as to operate the people of these systems.But the fact is not such, and software error took place when operation really, but these mistakes have been repaired soon, so nobody once noticed wrong generation.In order to obtain the statistics accurately about long-time stability, the number of times that we just must note system start-up and stop to be used as the parameter of " health degree " of the system of weighing with this.If do not collect any such statistics, be " health " fully with regard to the performance of illustrative system system-level.
Overwhelming majority code all is clean, but the distribution of dirty code is not real " ladder " function (stepfunction) (just not separatrix distinguish " this part code is bad; be careful and treat " and " this part code is good ") clearly but a distribution of dispersing
Promptly have the module of minority that many spinoffs (we do not worry this part) are arranged, what allow more the people worries then is the module that only contains one or two band spinoff primitive of those One's name is legions.
Because for the more deep understanding of code neither one, the root of problem that I can't conclude also here that Here it is can not say that these have calling to system of potential spinoff and have brought problem, and perhaps these call and are harmful to.
In any case I want that the thing of expressing is clearly.Depending the system that the programming standard is not sufficient to let us alone all writes in a kind of special mode.If someone wishes code forcibly is divided into clean code and dirty code, so just must be aided with the support of instrument, the support of the policy of some compulsory executions also will be arranged.
Whether really will so do is controversial---also might best mode be to allow programmers to violate the programming standard, just wish they when violating the programming standard, know WKG working what.
9API and agreement
When we write a software module, we need describe how to use it.It is exactly to be all derivative function definition one cover programming language API of module that a kind of way is arranged.In order to accomplish this point, we can save the type system of mentioning with 3.9.
The method of definition API is very general in fact.Between the different language, the details of type designations is different, and between the different systems, the language of bottom is realized for the mandatory degree of the requirement of type system also different.If type system is had strict mandatory requirement; so this language just be known as " strongly-typed " (strongly typed); otherwise it just is called as " weak type " (untyped)---this point causes through regular meeting to be obscured, and is easy to be violated because its type system of language of type declarations is carried out in many requirements.
SODBS does not require type declarations, but is " type safety " (type safe), and the meaning is to violate the bottom-layer-type system in a kind of mode that can the destruction system.
Even our language is not strongly-typed, but type declarations can be used as a kind of valuable document, and can be used as the input of a regime type detector, and the regime type detector can be used for carrying out the runtime type inspection.
Unfortunately, the behavior for prehension program of only writing out API according to usual mode is not enough.
For example, see following code snippet:
silly()->
{ok,H}=file:open(″foo.dat″,read),
file:close(H),
file:read_line(H).
The API that provides in the examples according to the requirement of type system and 3.9 joints, this section program is fully legal.
But it obviously is fully nonsensical, because we can not expect to read thing from a file of having closed.
In order to correct top problem, we can add an extra state parameter.Be aided with a kind of symbol of quite understanding, can write like this about the API of file operation:
+type?start?x?file:open(fileName(),read|write)->
{ok,fileHandle()}x?ready
|{error,string()}x?stop.
+type?ready?x?file:read_line(fileHandle())->
{ok,string()}x?ready
|eof?x?atEof.
+type?atEof|ready?x?file:close(fileHandle())->
true?x?stop.
+type?atEof|ready?x?fi?le:rewind(fi?leHandle())->
true?x?ready
This API model four kinds of state variable: start, ready, atEof and stop.State start represents that file also is not opened.State ready represents that file has been ready to be read, and atEof has represented the ending of file.File operation always begins with the start state, and stops with the stop state.
API just can so explain now, for example, when file is in state ready is, the function operation of carrying out file:read_line is legal.It otherwise return a character string, at this time it still is in the ready state; Perhaps it returns eof, and this moment, it was in the atEof state.
In the atEof state, we can close files or return (rewind) file, and every other operation all is illegal.If we select back file, file will come back to the ready state so, and at this time the read_line operation just becomes legal again.
For API has increased status information, just for we provide a kind of judge sequence of operations whether with module the method that matches of design.
9.1 agreement
As seen we can demarcate the use order of a cover API, and same in fact thought also can be applied in the definition of agreement.
Suppose to have two parts to use the mode of hard news transmission to communicate, we want and can the protocol of messages that flow between these two parts be described once at some abstraction hierarchies.
Agreement P between two components A and the B can describe with a non-deterministic finite state machine (non-deterministic finite state machine).
Suppose process B is a file server, and A is a client that will use this file server
Program supposes that further session is connection-oriented.The agreement that should follow of file server can illustrate as follows so:
+state?start?x{open,fileName(),read|write}->
{ok,fileHandle()}x?ready
|{error,string()}x?stop.
+state?ready?x{read_line,fileHandle()}->
{ok,string()}x?ready
|eof?x?atEof.
+state?ready|atEof?x{close,fileHandle()}->
true?x?stop.
+state?ready|atEof?x{rewind,fileHandle())->
true?x?ready
The meaning of this protocol description is, if file server is in the start state, it just can receive { open, filename (), the such message of read|write} so, the response of file server or be to return { an ok, fileHandle () } message of type, and move to the ready state, or be to return { an error, string () } message, and move to the stop state.
If an agreement is described with the mode above similar, so just may write simple " protocol testing " program, place in the middle of two processes of carrying out protocol communication.Fig. 5 has just showed the situation of putting a protocol testing device C between process X and Y.
Fig. 5: two processes and protocol testing device when X when Y sends a message Q (Q is an inquiry), Y can be with a response R and a new state S as response.Value is to { R, S} just can carry out type checking with the rule in the protocol description.Protocol testing device C checks all message of dealing between X and the Y according to protocol description between X and Y.
In order to check protocol rule, detector just needs the state of access server, and this is because protocol description may also have following clauses and subclauses:
+state?Sn?x?T1->T2x?S2|T2x?S3
In this case, to be not sufficient to the next state of Differentiated Services device be S2 or S3 to the type of only observing return messages T2.
If we recall the example of aforesaid simple private server, the major cycle of our program just can be such:
loop(State,Fun)->
receive
{ReplyTo,ReplyAs,Q}->
{Reply,State1}=Fun(State,Q),
Reply!{ReplyAs,Reply},
loop(State1,Fun)
end.
This major cycle can make into again at an easy rate:
loop(State,S,Fun)->
receive
{ReplyTo,ReplyAs,Q}->
{Reply,State1,S1}=Fun(State,S,Q),
Reply!{ReplyAs,S1,Reply},
loop(State1,S1,Fun)
end.
Here S and S1 represent the state variable in the protocol description.The state of noting interface (being the value of the state variable used in the protocol description) is different with the state State of server.
After the modification above having carried out, just thoroughly become a kind of the permission has been installed in the dynamic protocol detector between client and the server to private server.
9.2API or agreement?
We have showed how to do same thing with two kinds of identical in essence modes the front.We can force a type system on our programming language, perhaps we can force a contract checking mechanism between two parts with the communication of message transfer mode.In these two methods, I prefer this method of contract detector.
The reason of first aspect has relation with the organizational form of our system.In our programming model, the mode that we have adopted individual components and hard news to transmit.It is " black box " that each parts is taken as, and from the outside of black box, can't see the calculating of the inside fully and how to carry out.Whether the behavior that unique important thing is exactly a black box follows his protocol description.
In the inside of black box, may be because of the reason of efficient or other programming aspects, we need use some obscure programmed methods, even run counter to the programming practice that all general knowledge rules are become reconciled.But, just do not concern as long as the external behavior of system has been observed protocol description at all.
As long as simple expansion, protocol specification just can be extended to the NOT-function attribute of system.For example, we can add the notion of a time in our protocol description language, and we just can express so so:
+type?Si?x{operationl,Argl}->
valuel()x?Sj?within?T1
|value2()x?Sk?after?T2
The meaning is that operationl should return the data structure of a valuel () type at T1 in the time, or returns the data structure of a value2 () type after the time at T2.
The reason of second aspect is relevant with the position that is operated in the system that we did.Place the structure that a contract detector never interferes with these parts itself in the outside of parts, and give our system's increase or delete various selftest means a kind of approach flexibly is provided.Make our system can carry out run-time check, and being configured in more mode.
9.3 interactive component system
How does the SODBS system communicate by letter with the external world?---when we want to make up a distributed system, and SODBS is when being in many interactive component one, and it is very interesting that this problem just becomes.We see: any PLITS system all is based upon on module (module) and these two kinds of basic building blocks of message (message).Module is the entity of a kind of self-contained (self-contained), just as the class among Simula or the Smalltalk, SAIL process, CLU module.It is unimportant that module itself with what programming language is encoded, and we wish to accomplish that the disparate modules on the different machines can write with different language fully.
In order to do an interactive component system, we must make different parts reach an agreement in many aspects, and we need:
A kind of transformat and a kind of mapping method from the entity language to the transformat.
One cover type system, it is based upon on the entity of transformat.
A kind of view descriptive language based on type system.
About this transformat, we have adopted the method for a kind of UBF of being called (abbreviation of Universal Binary Format), and this method method designs for fast resolving.
9.4 discuss
I still want to return the key problem of this document---and how do we make up a reliable system when mistake occurring? jumping out this system, it is regarded as one group of black box that intercoms mutually, is a kind of useful thinking.
If we can describe the agreement that communication port is followed between two black boxs formally, we just can utilize this point so, as a kind of means of detection and Identification mistake.Which parts we can also say has exactly gone out mistake.
This framework satisfies the demand R1-R4 of 2.5 joints just, and therefore according to our reasoning, this framework can be used in to be write in the fault tolerant system.
This point also conforms to the saying of the 5.1st joint " attempting to do the thing that is easier to accomplish ".If the realization of a black box function is broken down, we can switch to another better simply realization of black box function so.The machine-processed unit's layer (meta-level) that just can be used in of protocol testing decides this which uses realize so, in case make a mistake, just selects a better simply realization.When parts are in physically independently on the different machines, accomplish to have isolated the requirement of nature just by force.
10 sum up
Work described in this document, and the related work done of developer, verified many unusual things, that is: SODBS and correlation technique thereof are effectively, this is that a result who edifies meaning is arranged very much.Once the someone thought, the system as SODBS is impossible comprehensively competent enterprise-level financial application.Yet on the many products of Beijing money and letter company limited by guarantee, showed that successfully SODBS is a kind of very suitable system for this series products.Not only these products are successful, and become the bellwether in the market of their cut-throat competitions separately, and this fact is very meaningful.
Using modified Beowulf formula (do not have fixing Control Node) server cluster cooperates and isolates by force, lightweight process and the system that do not have a shared drive are truly feasible in practice, this oneself of system detects, self-repairing capability is strong, and can be used for designing complicated high capacity enterprise-level financial sector.Being structured in the aims of systems of making rational behavior when the software and hardware mistake occurring has reached.

Claims (5)

1. the self-organization distribution financial sector can be used for handling lastingly pos machine payment and online payment business, it is characterized in that: adopted the architecture of distributed treatment and self-organization, reached and make stable, the lasting online effect of system height.
2. self-organization distribution financial sector according to claim 1, it is characterized in that: the server hardware node carries out netted connection, there is not specific key node, even realized the yet unlikely entire system paralysis of causing of breaking down of part server hardware or software.
3. self-organization distribution financial sector according to claim 1, it is characterized in that: soft, the hardware fault of system can be detected automatically by system self, and make automated intelligent and handle (as restarting or the initialization error node), report to the police in modes such as acousto-optic, SMS simultaneously.
4. self-organization distribution financial sector according to claim 1 is characterized in that: carry out strong the isolation between each process of software systems, shared drive does not make the mistake of part software module can not be diffused in the system, keeps the robustness and the stability of total system.
5. self-organization distribution financial sector according to claim 1 is characterized in that: the increase and decrease of hardware node and system software upgrading or reconfigure all and can onlinely carry out need not to restart or Break-Up System.
CN200810223932A 2008-10-10 2008-10-10 Self-organization distribution business system Pending CN101727629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810223932A CN101727629A (en) 2008-10-10 2008-10-10 Self-organization distribution business system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810223932A CN101727629A (en) 2008-10-10 2008-10-10 Self-organization distribution business system

Publications (1)

Publication Number Publication Date
CN101727629A true CN101727629A (en) 2010-06-09

Family

ID=42448493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810223932A Pending CN101727629A (en) 2008-10-10 2008-10-10 Self-organization distribution business system

Country Status (1)

Country Link
CN (1) CN101727629A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581289A (en) * 2012-08-09 2014-02-12 国际商业机器公司 Method and system conducive to service provision and coordination of distributed computing system
US9678801B2 (en) 2012-08-09 2017-06-13 International Business Machines Corporation Service management modes of operation in distributed node service management
CN107634949A (en) * 2017-09-21 2018-01-26 明阳智慧能源集团股份公司 Electric power networks framework Prevention-Security module and its physical node, network defense method
CN108337911A (en) * 2015-04-01 2018-07-27 起元技术有限责任公司 Db transaction is handled in distributed computing system
CN111026091A (en) * 2019-12-27 2020-04-17 中国科学技术大学 Distributed telescope equipment remote control and observation system
CN111311892A (en) * 2020-03-23 2020-06-19 中国建设银行股份有限公司 Bank branch alarm processing method based on Internet of things and branch management center system
CN112015469A (en) * 2020-07-14 2020-12-01 北京淇瑀信息科技有限公司 System reconfiguration method and device and electronic equipment
CN112749042A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Application running method and device
CN114817908A (en) * 2022-04-18 2022-07-29 北京凝思软件股份有限公司 Self-isolation method, system, terminal and medium for dual-computer hot standby software

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10389824B2 (en) 2012-08-09 2019-08-20 International Business Machines Corporation Service management modes of operation in distributed node service management
US11223690B2 (en) 2012-08-09 2022-01-11 International Business Machines Corporation Service management modes of operation in distributed node service management
CN103581289A (en) * 2012-08-09 2014-02-12 国际商业机器公司 Method and system conducive to service provision and coordination of distributed computing system
US9678802B2 (en) 2012-08-09 2017-06-13 International Business Machines Corporation Service management modes of operation in distributed node service management
US9749415B2 (en) 2012-08-09 2017-08-29 International Business Machines Corporation Service management roles of processor nodes in distributed node service management
US9762669B2 (en) 2012-08-09 2017-09-12 International Business Machines Corporation Service management roles of processor nodes in distributed node service management
US9678801B2 (en) 2012-08-09 2017-06-13 International Business Machines Corporation Service management modes of operation in distributed node service management
CN103581289B (en) * 2012-08-09 2017-04-12 国际商业机器公司 Method and system conducive to service provision and coordination of distributed computing system
CN108337911A (en) * 2015-04-01 2018-07-27 起元技术有限责任公司 Db transaction is handled in distributed computing system
CN107634949A (en) * 2017-09-21 2018-01-26 明阳智慧能源集团股份公司 Electric power networks framework Prevention-Security module and its physical node, network defense method
CN107634949B (en) * 2017-09-21 2020-02-07 明阳智慧能源集团股份公司 Power network architecture security defense module, physical node thereof and network defense method
CN112749042B (en) * 2019-10-31 2024-03-01 北京沃东天骏信息技术有限公司 Application running method and device
CN112749042A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Application running method and device
CN111026091A (en) * 2019-12-27 2020-04-17 中国科学技术大学 Distributed telescope equipment remote control and observation system
CN111026091B (en) * 2019-12-27 2022-09-30 中国科学技术大学 Distributed telescope equipment remote control and observation system
CN111311892A (en) * 2020-03-23 2020-06-19 中国建设银行股份有限公司 Bank branch alarm processing method based on Internet of things and branch management center system
CN112015469A (en) * 2020-07-14 2020-12-01 北京淇瑀信息科技有限公司 System reconfiguration method and device and electronic equipment
CN112015469B (en) * 2020-07-14 2023-11-14 北京淇瑀信息科技有限公司 System reconstruction method and device and electronic equipment
CN114817908A (en) * 2022-04-18 2022-07-29 北京凝思软件股份有限公司 Self-isolation method, system, terminal and medium for dual-computer hot standby software

Similar Documents

Publication Publication Date Title
CN101727629A (en) Self-organization distribution business system
CN101553769B (en) Method and system for tracking and monitoring computer applications
US8407723B2 (en) JAVA virtual machine having integrated transaction management system and facility to query managed objects
Nguyen-Tuong Integrating fault-tolerance techniques in grid applications
Afonso et al. Applying aspects to a real-time embedded operating system
Kienzle Open Multithreaded Transactions: A Transaction Model for Concurrent Object-Oriented Programming
Perkusich et al. Embedding fault-tolerant properties in the design of complex software systems
Garcia et al. An architectural-based reflective approach to incorporating exception handling into dependable software
Garcia et al. A unified meta-level software architecture for sequential and concurrent exception handling
Collet The limits of network transparency in a distributed programming language.
Sharma Modular verification of distributed systems with Grove
Lisbôa A new trend on the development of fault-tolerant applications: software meta-level architectures
Yong Replay and distributed breakpoints in an OSF DCE environment
Romanovsky On structuring cooperative and competitive concurrent systems
Romanovsky et al. Coordinated exception handling in real-time distributed object systems
Xu et al. Supporting and controlling complex concurrency in fault-tolerant distributed systems
Guo Verifying Erlang/OTP components in μ CRL
Navarro et al. Detecting and coordinating complex patterns of distributed events with KETAL
Dabaghchian Static and Dynamic Verification of Distributed Systems
Correia et al. Practical database replication
Edwards An ANSA Analysis of Open Dependable Distributed Computing
Tröger Dependable Systems Software Dependability
Diaz et al. DROL: A distributed and real-time object-oriented logic environment
Hmida et al. Dynamically Adapting Clients to Web Services Changing.
Mancini Reliability issues in the design of distributed object-based architectures

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20100609