CN101673354A

CN101673354A - Operant conditioning reflex automatic machine and application thereof in control of biomimetic autonomous learning

Info

Publication number: CN101673354A
Application number: CN200910086990A
Authority: CN
Inventors: 阮晓钢; 郜园园; 蔡建羡; 陈静; 戴丽珍
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2009-06-12
Filing date: 2009-06-12
Publication date: 2010-03-17

Abstract

The invention provides an operant conditioning reflex automatic machine model, and designs a biomimetic autonomous learning control method based on the model. Aiming at the control problem of a natural world system, the operant conditioning reflex automatic machine model which can be used for describing and imitating and is designed with the function of self-organization (including autonomous learning and self-adaption) can be designed by utilizing the biomimetic self-organizing learning method, thus effectively applying bionics and psychics to the control of a system. By using the operant conditioning reflex automatic machine model (OCM), the operation (control quantity) is firstly selected at random according to the input and the state of the current system, and the operation with high probability value is inclined to be selected due to good operation orientation. After the control is implemented, the state is observed, and the control effect is output outside; then, an orientation unit is used for evaluating the state after control, modifies the probability value of a rule collection and continuously acquires the behavior with the good operation orientation, thus being capable of conveniently selecting more excellent behavior next time; and finally, the autonomous control can be realized.

Description

Operant conditioning reflex automat and the application in bionical autonomous learning control thereof

Technical field

The present invention relates to a kind of biorobot (Operant Conditioning Automata is hereinafter to be referred as OCM) based on the operant conditioning reflex principle.It utilizes computer technology, automatic control technology, bionics, psychology, biology to wait and realizes bionical autonomous learning control.

Background technology

The present invention is based on the operant conditioning reflex theory of Skinner, it is different from Pavlov's classical conditioned reflex.Classical conditioning is the process that is induced reaction by conditional stimulus, and its formula is S → R, and reaction has congenital, and stimulus is as a kind of reinforcement, and is expert at for just being presented before; And operant conditioned reflex is at first to do certain operant response, the process that is strengthened then, and its formula is R → S, and reaction has posteriority, and its reinforcement appears at after the behavior appearance, and purpose is to allow the desired specific behavior of the experimenter of subjects association.In view of the above, Skinner further proposes two kinds of study again: a kind of is the study of classical conditioning formula, and another kind is the study of operant conditioning reflex formula.Two kinds of study form no less importants, and the reinforcing stimulus of operant conditioned reflex has clear and definite purpose, more helps the specific behavior of subjects association.

Automaton model of the present invention is based upon on the basis of finite-state automata, and a general finite state machine is five-tuple a: FSM={A, Z, S, f, g}.The meaning of each symbol wherein: (1) A represents that limited incoming symbol set (2) S represents that limited (inside) state symbol set (s (0) ∈ S is an original state) (3) Z represents that limited output (receive status) assemble of symbol (4) f:S * A → S represents that state transition function (5) g:S → Z represents output function.

At present, similar patent of invention mainly is based on the research of the method for finite-state automata or cellular automaton, general phenomenons such as the cellular automaton that adopts is mainly used in research information transmission, calculating, structure, grows, duplicates, competition, but also well do not using aspect the perception of simulated animal and the cognitive behavior.As application (patent) number be 200610119136.X, is called the Edge-Detection Algorithm based on cellular automaton; Application (patent) number is 200810031543.4, and name is called the many Ai Zhen body dynamic multi-objective collaboration tracking method based on finite-state automata.The patent of operant conditioning reflex automat and application facet thereof yet there are no.

The present invention proposes a kind of operant conditioning reflex automaton model, and based on this modelling a kind of method of bionical autonomous learning control.The objective of the invention is to test and illustrate that the method has realized simulated operation conditioned reflex learning mechanism, confirm to go to realize the feasibility of the Model free control of some state continuous control system with inverted pendulum control problem with the method with the Skinner pigeon.

Summary of the invention

The present invention is different from traditional control method, be based on the operant conditioning reflex study mechanism, principle according to automat, Balance Control problem at experiment of Skinner pigeon and inverted pendulum, use bionical self-organization (comprising self study and self-adaptation) learning method, design a kind ofly can be used for describing, simulation, design have self-organization (comprising self study and self-adaptation) function operations conditioned reflex automaton model, thereby effectively with bionics, psychology and biological applications in control system, thereby realize the function of bionical autonomous learning control.

Operant conditioning reflex automat of the present invention is one eight tuple

OCM＝<A，S，O，Z，R，f，ψ，δ>，

Wherein,

(1) incoming symbol of OCM set: A={a _j| j=0,1,2 ..., n _A, a _jBe j incoming symbol of OCM;

(2) internal state of OCM set: S={s _j| i=0,1,2 ..., n _S, s _iBe i state symbol of OCM;

(3) built-in function of OCM set: O={o _k| k=1,2 ..., n _O, o _kBe k functional symbol of OCM;

(4) output symbol of OCM set: Z={z _m| m=0,1,2 ..., n _Z, z _mBe m output symbol of OCM;

(5) regular collection of OCM: R={r _Ijk| i ∈ 0,1,2 ..., n _S; J ∈ 0,1,2 ..., n _A; K ∈ 1,2 ..., n _O, each element r of R _Ijk∈ R represents " condition-operation " rule at random:

r _ijk：s _i×a _j→o _k(p _ijk)

Be that OCM is in s at state _i(∈ S) and be input as a _jUnder the condition of (∈ A) according to Probability p _IjkImplementation and operation o _k(∈ O), p _Ijk=p (o _k| s _i∩ a _j) be that OCM is in s at state _iBe input as a _iImplementation and operation o under the condition of ∈ A _kProbability, claim regular r again _IjkExcitation probability.

(6) state space equation of OCM:

f : \{\begin{matrix} f_{S} : S (t) \times A (t) \times O (t) &RightArrow; S (t + 1) \\ f_{Z} : S (t) \times A (t) \times O (t) &RightArrow; Z (t + 1) \end{matrix}

Wherein, f _SIt is the state transition equation of OCM, OCM t+1 state s (t+1) (∈ S) constantly by t constantly state s (t) (∈ S) and t imports a (t) (∈ A) constantly and t operation o (t) (∈ O) constantly is definite, irrelevant with its t state, input and operation before the moment, and, f _SCan be unknown, can observe but the result of OCM state transitions is OCM self; f _ZIt is the output equation of OCM, OCMt+1 output z (t+1) (∈ Z) constantly by t constantly state s (t) (∈ S) and t imports a (t) (∈ A) constantly and t operation o (t) (∈ O) constantly is definite, irrelevant with its t state and input and operation before constantly, the output of OCM is that the external world can be observed;

(7) the state orientation function of OCM: ψ: S * A → [h, q], h is defined as the poorest orientation function value of orientation, and q is that (the orientation here defines from biological significance the best orientation function value of orientation, the direction of environment decision biological evolution, i.e. Sheng Wu orientation).The value of p and q can be come value according to handled concrete object.For arbitrary s _i(∈ S) and input a _j(∈ A), ψ _Ij=ψ (s _i, a _j) be that OCM is about state s _iWith input a _jExpectation value, if ψ _Ij＜0, then claim s _iBe that OCM is being input as a _jThe time negative state of orientation; If ψ _Ij=0, then claim s _iBe that OCM is being input as a _jThe time zero state of orientation; If ψ _Ij＞0, then claim s _iBe that OCM is being input as a _jThe time just orientation state;

(8) the operant conditioning reflex law of learning of OCM:

If OCM t state constantly is s (t)=s _a∈ S, input a (t)=a _b∈ A, according among the set R at random " condition-operation " rule choose be operating as o (t)=o _c∈ O observes t+1 state s (t+1)=s constantly behind the implementation and operation _d∈ S, then based on the operant conditioning reflex principle, " condition-operation " regular p at random among the operational set R _Abk(k=1,2 ..., n _o) excitation probability comply with

δ : \{\begin{matrix} &ForAll; k &NotEqual; c p_{abk} (t + 1) = p_{abk} (t) - ξ ({\overset{&RightArrow;}{ψ}}_{abk}) \cdot p_{abk} (t) \\ p_{abk} (t + 1) = \max \min (p_{abk} (t + 1), 0,1) \\ p_{abc} (t + 1) = 1 - \underset{k &NotEqual; c}{Σ} p_{abk} (t + 1) \end{matrix}

Regulate, wherein,

Be that OCM is in s at state _a(∈ S) and be input as a _bImplementation and operation o under the condition of (∈ A) _c(∈ O) back state transitions is s _dThe variable quantity of (∈ S) back orientation function value, available this variable quantity is judged the quality of this operation;

Be monotonic increasing function, and if only if for ξ (x)=0

R is the working rule sum, and λ is a learning rate, i.e. the speed of each iterative learning.p _Abc(t) (a ∈ 0,1,2 ..., n _S; B ∈ 0,1,2 ..., n _A; C ∈ 1,2 ..., n _O) be that the OCM state is in s _a(∈ S) and be input as a _bImplementation and operation o when (∈ A) _cProbability p (the o of (∈ O) _c| s _a∩ a _b) in t value constantly, when

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dOrientation function value after (∈ S) diminishes, i.e. orientation variation, then p _Abc(t+1)＜p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) reduces; When

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dOrientation function value after (∈ S) is constant, and promptly orientation is also constant, at this moment p _Abc(t+1)=p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) is constant; When

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dIt is big that orientation function value after (∈ S) becomes, and promptly orientation improves, then p _Abc(t+1)＞p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) increases.Max min (p wherein _Abk(t+1), be to work as p 0,1) _Abk(t+1)＞1 o'clock p _Abk(t+1)=1; p _Abk(t+1)＜0 o'clock p _Abk(t+1)=0 can guarantee p _Abk(t+1) ∈ [0,1], and

Promptly be illustrated in and take the probability of different operating under the same state of same input and be 1, when t → ∞, if p _Abc(t) → 1, description operation o _c(∈ O) is in s at state _a(∈ S) and be input as a _bBehavior optimum under the condition of (∈ A).Our given study iterations Tf or optimum behavior are selected probability max-thresholds p generally speaking _ε, be in s when study reaches iterations or works as a certain state _a(∈ S) and be input as a _bImplementation and operation o under the condition of (∈ A) _cThe Probability p of (∈ O) _Abc(t) 〉=p _εIn time, stop to learn p _ε∈ [0.7,1] is specifically set by the system environments of reality, generally is made as p _ε=0.9.

Key character of the present invention is to simulate biological operant conditioning reflex mechanism, thereby has bionical self organizing function, comprises self study and adaptation function, can be used for describing, and simulates, and designs the system of various self-organizations.

Technical scheme of the present invention is seen Fig. 1, Fig. 2.

Method step of the present invention is as follows:

(1) sets the starting condition of testing.The initial state s of given OCM (0), the initial input a (0) of given OCM, learning rate λ, each " condition-operation " regular r at random among the given R _Ijk(i ∈ 0,1,2 ..., n _S; J ∈ 0,1,2 ..., n _A; K ∈ 1,2 ..., n _o) initial excitation probability p _Ijk(0)=and l/r, given iterative learning step number Tf or optimum behavior are selected Probability p _ε, determine λ, Tf and p by requirement of experiment and environment _ε, generally get λ=0.05, Tf=1000, p _ε=0.9.

(2) selection operation and implementation and operation at random.According to OCMt constantly state s (t) ∈ S and input a (t) ∈ A and R in each " condition-operation " regular r at random _Ijk(i ∈ 0,1,2 ..., n _S; J ∈ 0,1,2 ..., n _A; K ∈ 1,2 ..., n _o) excitation probability t value p constantly _Ijk(t), press t moment state probable value p of each operation down _Ijk(t) distribute, select t operation o (t) ∈ O constantly randomly; If OCMt is state s (t)=s constantly _a, input a (t)=a _b, choose t operation o (t)=o constantly _c, then the state of OCM is according to f _S: S (t) * A (t) * O (t) → S (t+1) state transition equation occurrence features shifts;

(3) operant conditioning reflex.If observe state s (t+1)=s _d∈ S, t+1 constantly then operant conditioning reflex unit δ to " condition-operation " regular r at random _AbcExcitation probability is regulated, r _AbcExcitation probability t+1 value constantly

δ : \{\begin{matrix} &ForAll; k &NotEqual; c p_{abk} (t + 1) = p_{abk} (t) - ξ ({\overset{&RightArrow;}{ψ}}_{abk}) \cdot p_{abk} (t) \\ p_{abk} (t + 1) = \max \min (p_{abk} (t + 1), 0,1) \\ p_{abc} (t + 1) = 1 - \underset{k &NotEqual; c}{Σ} p_{abk} (t + 1) \end{matrix}

Maxmin (p wherein _Abk(t+1), be to work as p 0,1) _Abk(t+1)＞1 o'clock p _Abk(t+1)=1; p _Abk(t+1)＜0 o'clock p _Abk(t+1)=0; Can guarantee p _Abk(t+1) ∈ [0,1].And

Σ_{k = 1}^{n_{o}} p_{abk} (t) = 1 .

(4) by the output equation f of system _Z: S (t) * A (t) * O (t) → Z (t+1) externally exports Z (t+1).

(5) repeat (2)-step of (4), up to reaching iterative learning number of times Tf or working as p _Abc(t+1)＞p _εIn time, stop to test.

The process flow diagram of the inventive method is seen Fig. 3.

Advantage of the present invention is can simulate and bionical nature life " adjusting to changed conditions property ", makes machine life have thinking, memory and learning functionality.Have the machine life of cognitive behavior or cognitive ability thereby can not only change factum, and can improve factum.What make machine life performance has bio-imitability and intelligent more.

Description of drawings

Fig. 1 is a structural representation of the present invention

Fig. 1: 1 incoming symbol set, the set of 2 internal states, 3 built-in functions set, the set of 4 output symbols, 5 " condition-operation " regular collections at random, 6 state space unit, 7 state orientation functions, 8 operant conditioning reflex law of learnings.

Fig. 2 is the application structure block diagram of patent of the present invention

Fig. 3 is the method flow diagram of patent of the present invention

Fig. 4 is three kinds of behavior number of times design sketchs of Skinner pigeon experiment

1 red button, 2 yellow button, 3 blue buttons

Fig. 5 (a) is that the Skinner pigeon is tested iterative learning simulated effect figure, the behavior evolutionary operator probability figure that is respectively at 1000 times and 1000 times frequency of training (b).1 red button, 2 yellow button, 3 blue buttons

Fig. 6 (a) is the drift angle curve map under the definite model of inverted pendulum Balance Control experiment

Fig. 6 (b) is the drift angle speed curve diagram under the definite model of inverted pendulum Balance Control experiment

Fig. 7 (a) is the drift angle curve map under the inverted pendulum Balance Control probabilistic model

Fig. 7 (b) is the drift angle speed curve diagram under the inverted pendulum Balance Control experiment probabilistic model

Embodiment

Embodiment one: shown in Fig. 4,5, and the experiment of Skinner operant conditioning reflex pigeon.

The target of Skinner pigeon experiment training is to make its association peck the operation behavior of red button.Obtain food (positive reinforcement stimulation) when it pecks red button, do not have any stimulation when pecking yellow button, shock by electricity when pecking blue buttons (negative reinforcement stimulation).Adopt Skinner operant conditioning reflex automaton model method to experimentize, as shown in Figure 1, 2, 3.

Provide the discrete mathematics model that the pigeon experiment is simplified earlier: establishing pigeon has three states to be respectively: starvation, semistarvation state and Zero Hunger state.When the pigeon state was hungry, when giving its food, state transitions was semistarvation, and when not stimulating to pigeon food or to its electric shock, pigeon still shows as starvation, is output as the state of pigeon this moment; When the pigeon state was semistarvation, when giving its food, state transitions was a Zero Hunger, did not stimulate to its food or to its electric shock, and pigeon shifts and is starvation, is output as the state of pigeon at this moment; When the pigeon state was Zero Hunger, when giving its food, state still shifted and is Zero Hunger, did not give its food, and state transitions is the semistarvation state, stimulated to its electric shock, and state transitions is a starvation, was output as the pigeon state of this moment.Obtain the state transition equation f of its model _s: S (t) * A (t) * O (t) → S (t+1) specifically is expressed as:

f(s ₀，a ₀，o ₁)＝s ₁ f(s ₁，a ₀，o ₁)＝s ₂ f(s ₂，a ₀，o ₁)＝s ₂

f(s ₀，a ₁，o ₂)＝s ₀ f(s ₁，a ₁，o ₂)＝s ₀ f(s ₂，a ₁，o ₂)＝s ₁

f(s ₀，a ₂，o ₃)＝s ₀ f(s ₁，a ₂，o ₃)＝s ₀ f(s ₂，a ₂，o ₃)＝s ₀

The incoming symbol set A={ a of pigeon experiment ₀, a ₁, a ₂, a wherein ₀Expression is pecked red button to pigeon food, a ₁Expression is pecked yellow button to pigeon food, a ₂The blue buttons stimulation of shocking by electricity is pecked in expression; State set S={s ₀, s ₁, s ₂, s wherein ₀The expression starvation, s ₁The expression semistarvation, s ₂Expression Zero Hunger state; Operational set O={o ₁, o ₂, o ₃, o wherein ₁The expression pigeon pecks red button, o ₂The expression pigeon pecks yellow button, o ₃The expression pigeon pecks blue buttons, and pigeon pecks redly during beginning, and yellow and blue three buttons are at random; Regular collection R is: r _Ijk: s _i* a _j→ o _k(p _Ijk), represent that promptly pigeon is being in s _i(∈ S) state and be input as a _jUnder the condition of (∈ A) according to Probability p _IjkImplementation and operation o _k(∈ O), p _Ijk=p (o _k| s _i∩ a _j) be that pigeon is in s at state _iBe input as a _iImplementation and operation o under the condition of ∈ A _kProbability, claim regular r again _IjkExcitation probability.Setting pigeon discrete state orientation function ψ: S * A → and 1,0,1,2,3}, concrete function expression is:

ψ ₀₀(s ₀，a ₀)＝1 ψ ₁₀(s ₁，a ₀)＝2 ψ ₂₀(s ₂，a ₀)＝3

ψ ₀₁(s ₀，a ₁)＝0 ψ ₁₁(s ₁，a ₁)＝1 ψ ₂₁(s ₂，a ₁)＝2

ψ ₀₂(s ₀，a ₂)＝-1 ψ ₁₂(s ₁，a ₂)＝0 ψ ₂₂(s ₂，a ₂)＝1

ψ ₀₀(s ₀, a ₀When)=1 expression be hunger when the pigeon state, to its food, then its state orientation function value more greatly 1;

ψ ₁₁(s ₀, a ₁When)=0 expression is hungry when the pigeon state, do not give its food, then its state orientation function value is 0;

ψ ₂₂(s ₀, a ₂When)=-1 expression was hungry when the pigeon state, to its electric shock, then its state orientation function value is less was-1.

Pigeon experimental technique basic step is as follows:

(1) sets the starting condition of testing.Initial input is set to pigeon food a ₀, original state is set at starvation s ₀, set the initial probability that pigeon pecks three buttons and be 1/3, it is impartial promptly just having begun the chance that pigeon pecks three buttons, learning rate λ=0.05 is set optimum behavior and is selected probability threshold value p _ε=0.97.

(2) selection operation and implementation and operation at random.Being located at the state that t observes pigeon constantly is s _a∈ S is input as a _b∈ A, state orientation function value is ψ _Ab∈ ψ, according among the set R at random " condition-operation " rule press probable value p that t constantly respectively operates _Ijk(t) the operation o that distributes and choose _c∈ O, implementation and operation o _cBehind the ∈ O according to the state transition function f of pigeon _s:

S (t) * A (t) * O (t) → S (t+1) carries out state transitions.Promptly comply with

Carry out state transitions.F (s wherein ₀, a ₀, o ₁)=s ₁Expression is the semistarvation state to pigeon state transitions under the food when red button is pecked in its selection when pigeon is starvation; F (s ₁, a ₁, o ₂)=s ₀Expression is not a starvation to pigeon state transitions under its food when it selects yellow button when pigeon is the semistarvation state; F (s ₂, a ₂, o ₃)=s ₀The expression pigeon is when being the Zero Hunger state, is starvation to its electric shock pigeon state transitions when it selects blue buttons.

Output function definition f _z: z _m=s _i, m=i, i={0,1,2}.Output set Z={z ₀, z ₁, z ₂, z ₀=s ₀, z ₁=s ₁, z ₂=s ₂When pigeon after t shifts constantly, t+1 constantly state is s _d∈ S, the state orientation function value that obtains t+1 moment pigeon so is ψ _Ab∈ ψ.

(3) operant conditioning reflex.According to the variable quantity of the state orientation function ψ value of pigeon, promptly

According to operant conditioning reflex unit δ to " condition-operation " regular r at random _AbcExcitation probability is regulated.The δ here is:

δ : \{\begin{matrix} &ForAll; k &NotEqual; c p_{abk} (t + 1) = p_{abk} (t) - ξ ({\overset{&RightArrow;}{ψ}}_{abk}) \cdot p_{abk} (t) \\ p_{abk} (t + 1) = \max \min (p_{abk} (t + 1), 0,1) \\ p_{abc} (t + 1) = 1 - \underset{k &NotEqual; c}{Σ} p_{abk} (t + 1) \end{matrix}

Wherein,

Be that pigeon is at state s _a(∈ S) and be input as a _bImplementation and operation o under the condition of (∈ A) _c(∈ O) back state transitions is s _dThe variable quantity of (∈ S) back orientation function value, available this variable quantity is judged the quality of this operation;

Be monotonic increasing function, and if only if for ξ (x)=0 Be the working rule sum, λ is a learning rate, and promptly the speed of each iterative learning is got r=3 here, λ=0.05.p _Abc(t) (a ∈ 0,1,2 ..., n _S; B ∈ 0,1,2 ..., n _A; C ∈ 1,2 ..., n _o) be that pigeon is in s _a(∈ S) and be input as a _bImplementation and operation o when (∈ A) _cProbability p (the o of (∈ O) _c| s _a∩ a _b) in t value constantly, when

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dOrientation function value after (∈ S) diminishes, and promptly at this moment the orientation variation obtains p _Abc(t+1)＜p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) reduces, and promptly diminishes in this input and state selecteed probability of the following behavior; When

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dOrientation function value after (∈ S) is constant, and promptly orientation is also constant, at this moment obtains p _Abc(t+1)=p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) is constant; When

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dIt is big that orientation function value after (∈ S) becomes, and promptly orientation improves, and at this moment obtains p _Abc(t+1)＞p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) increases.

Specifically, when t constantly the time pigeon state be hungry s ₀The time, if pigeon is pressed the excitation probability p among the regular collection R ₀₀₁(t)=p (o ₁| s ₀∩ a ₀Red button o has been selected to peck in)=0.55 ₁Operation, and give its food a ₀, by pigeon state transition equation f (s ₀, a ₀, o ₁)=s ₁Then next moment state transitions of pigeon is semistarvation s ₁, the orientation function value ψ of this moment ₁₀(s ₁, a ₀The orientation function value ψ of)=2 when previous hungry ₀₀(s ₀, a ₀)=1 obtains p ₀₀₁(t+1)＞p ₀₀₁(t), thus when study next time the probability selecting to peck red button just increase.

Maxmin (p wherein _Abk(t+1), be to work as p 0,1) _Abk(t+1)＞1 o'clock p _Abk(t+1)=1, p _Abk(t+1)＜0 o'clock p _Abk(t+1)=0 can guarantee p _Abk(t+1) ∈ [0,1], and

Promptly be illustrated in and take the probability of different operating under the same state of same input and be 1.When t → ∞, if p _Abc(t) → 1, description operation o then _c(∈ O) is in s at state _a(∈ S) and be input as a _bBehavior optimum under the condition of (∈ A).

(4) externally output.By output function definition f _z: z _m=s _i, i=0,1,2, m=i.Press output set Z={z ₀, z ₁, z ₂, z ₀=s ₀, z ₁=s ₁, z ₂=s ₂Externally export t+1 state constantly.

(5) whether the judgment experiment stop condition reaches.Work as p _Abc=p (o _c| s _a∩ a _b)＞p _εThe time, think that then pigeon learned an optimum operation behavior, then after this pigeon just continues to select this optimum operation behavior up to reaching iterations Tf under this input of this state.Otherwise repeat (2) one (4) experimental procedure, till satisfying condition.

The result shows, uses the model of above-mentioned operant conditioning reflex automat, after a period of time, pigeon peck get red button number of times apparently higher than pecking the number of times of getting other two buttons, see Fig. 4.Fig. 5 is pigeon experiment iterative learning simulated effect figure, the process of pigeon operant conditioning reflex study formation as can be seen from FIG..

Embodiment two: as Fig. 6-shown in Figure 7, and the experiment of the Balance Control of single inverted pendulum.

The target of inverted pendulum control is by applying a power u (controlled quentity controlled variable) for the dolly base, being functional symbol set O, i.e. u=o _k, k=1,2 ..., n _oFinal assurance bar does not fall down, and promptly is no more than a vertical off setting angular range that pre-defines.Adopt the method for Skinner operant conditioning reflex automaton model to control experiment, as shown in Figure 1, 2, 3.

Inverted pendulum can be described with the following equation of motion

\overset{\cdot \cdot}{θ} = \frac{m (m + M) gl}{(M + m) I + {Mml}^{2}} θ - \frac{ml}{(M + m) I + {Mml}^{2}} u

, wushu u=o _k, k=1,2 ..., n _oBelow substitution obtains

In the formula

I = \frac{1}{12} {mL}^{2},

l = \frac{1}{2} L .

Formula:

\overset{\cdot \cdot}{θ} = \frac{m (m + M) gl}{(M + m) I + {Mml}^{2}} θ - \frac{ml}{(M + m) I + {Mml}^{2}} o_{k}

In the formula

K=1,2 ..., n _o

By Euler method numerical approximation, available following difference equation comes the emulation reversible pendulum system:

θ (t + 1) = θ (t) + τ \cdot \overset{\cdot}{θ} (t)

\overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + τ \cdot \overset{\cdot \cdot}{θ} (t)

Time τ generally is set at 0.02 second, and the reversible pendulum system of giving more than obviously is a deterministic system.For the Model free control that is applicable to the continuous random system based on the method for operant conditioning reflex automaton model too, i.e. f are described _SCan be unknown.In above deterministic models, introduced noise signal and constituted an inverted pendulum model at random, promptly in emulation in order to the inverted pendulum equation above equation replaces down.

\overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + τ \cdot \overset{\cdot \cdot}{θ} (t) + d

Wherein d is a random noise, and d is at [1.5,1.5] last one even stochastic distribution noise here.

Output function definition f _z: z _m=s _i, m=i, i={0,1,2}.Output set Z={z ₀, z ₁, z ₂, z ₀=s ₀, z ₁=s ₁, z ₂=s ₂

The incoming symbol set A={ a of inverted pendulum experiment ₀, Wherein θ is the angle that fork departs from perpendicular line, For departing from the angular velocity of perpendicular line.State set S={s ₀, s ₁, s ₂, s wherein ₀Expression inverted pendulum state of a control is bad, s ₁Expression inverted pendulum state of a control better/bad, s ₂Expression inverted pendulum state of a control is good.Output set Z={z ₀, z ₁, z ₂, z ₀Expression inverted pendulum control effect is bad, z ₁Expression inverted pendulum control effect better/bad, z ₁The control of expression inverted pendulum is effective, promptly reaches the control requirement.Operational set O={o ₁, o ₂, o ₃, o wherein ₁Expression applies a power to the right, o for the dolly base ₂Expression applies small zero power, an o of approaching for the dolly base ₃Expression applies a power left for the dolly base.Regular collection R is: r _Ijk: s _i* a _j→ o _k(p _Ijk), represent that promptly inverted pendulum is being in s _i(∈ S) state and be input as a _jUnder the condition of (∈ A) according to Probability p _IjkImplementation and operation o _k(∈ O), p _Ijk=p (o _k| s _i∩ a _j) be that inverted pendulum is in s at state _iBe input as a _iImplementation and operation o under the condition of ∈ A _kProbability, claim regular r again _IjkExcitation probability.State orientation function value is ψ: S * A → (0,1,2), wherein ψ ₀₀(s ₀, a ₀)=0, ψ ₁₀(s ₁, a ₀)=1, ψ ₂₀(s ₂, a ₀)=2.

Inverted pendulum experiment control method basic step is as follows:

(1) sets the starting condition of testing.Wherein, gravity acceleration g=9.8m/s ², dolly mass M=1.0kg, the quality m=0.1kg of bar, half long L=0.5m of bar.Drift angle range Theta ∈ [0.1 ,+0.1] is set, angular velocity range

When stipulating inverted pendulum drift angle left avertence here on the occasion of, be negative value during right avertence, same, the angular velocity direction left the time for just, direction to the right the time for bearing.Initial input is a ₀(0) get θ (0)=5 °=0.087,

Wherein the pendulum angle value is converted into radian value.Original state is s (0)=s ₀, wherein working as θ ∈ [0.1 ,-0.03] or θ ∈ [+0.03 ,+0.1] is s ₀, promptly out of order, when θ ∈ (0.03 ,+0.005) or θ ∈ (+0.005 ,+0.03), be s ₁, promptly state better/bad, be s when θ ∈ [0.005 ,+0.005] ₂, promptly State Control is good.Set three operating physical forces of inverted pendulum, i.e. controlled quentity controlled variable O={o ₁, o ₂, o ₃}={-5,0.1,5}, inverted pendulum select the initial probability of these three operating physical forces to be 1/3, given iterative learning step number Tf=1000, and learning rate λ=0.02, working rule sum r=3, it is p that optimum behavior is selected probability threshold value _ε=0.95.

(2) selection operation and implementation and operation at random.Being located at the state that t observes inverted pendulum constantly is s _a∈ S is input as a _b∈ A, state orientation function value is ψ _Ab∈ ψ is according to the operation o that chooses of " condition-operation " rule at random among the set R _c∈ O, behind the implementation and operation according to the state transition equation f of inverted pendulum _s: S (t) * A (t) * O (t) → S (t+1) carries out state transitions.

The inverted pendulum state transition equation can be described with the following equation of motion

\overset{\cdot \cdot}{θ} = \frac{m (m + M) gl}{(M + m) I + {Mml}^{2}} θ - \frac{ml}{(M + m) I + {Mml}^{2}} o_{k}

In the formula

K=1,2 ..., n _o

Come the emulation reversible pendulum system with following difference equation:

θ (t + 1) = θ (t) + τ \cdot \overset{\cdot}{θ} (t)

\overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + τ \cdot \overset{\cdot \cdot}{θ} (t)

Time τ generally is set at 0.02 second, and the reversible pendulum system of giving more than obviously is a deterministic system.

Also available random inverted pendulum model comes emulation, promptly uses following formula:

\overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + τ \cdot \overset{\cdot \cdot}{θ} (t) + d

Formula above replacing

\overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + τ \cdot \overset{\cdot \cdot}{θ} (t) .

Wherein d is a random noise, here d be on [1.5,1.5]-even stochastic distribution noise.

Output function definition f _z: z _i=s _i, i=0,1,2.Output set Z={z ₀, z ₁, z ₂, z ₀=s ₀, z ₁=s ₁, z ₂=s ₂Constantly do exercises by inverted pendulum after equation shifts at t when inverted pendulum, t+1 moment state is s _d∈ S, the state orientation function value that obtains t+1 moment inverted pendulum so is ψ _Db∈ ψ.

(3) operant conditioning reflex.According to the variable quantity of the state orientation function ψ value of inverted pendulum, promptly According to operant conditioning reflex unit δ to " condition-operation " regular r at random _AbcExcitation probability is regulated.The δ here is:

δ : \{\begin{matrix} &ForAll; k &NotEqual; c p_{abk} (t + 1) = p_{abk} (t) - ξ ({\overset{&RightArrow;}{ψ}}_{abk}) \cdot p_{abk} (t) \\ p_{abk} (t + 1) = \max \min (p_{abk} (t + 1), 0,1) \\ p_{abc} (t + 1) = 1 - \underset{k &NotEqual; c}{Σ} p_{abk} (t + 1) \end{matrix}

Wherein,

Be that inverted pendulum is at state s _a(∈ S) and be input as a _bImplementation and operation o under the condition of (∈ A) _c(∈ O) back state transitions is s _dThe variable quantity of orientation function value before and after (∈ S), available this variable quantity is judged the quality of this operation;

Be monotonic increasing function, and if only if for ξ (x)=0

R is the working rule sum, and λ is a learning rate, and promptly the speed of each iterative learning is got r=3 here, λ=0.02.p _Abc(t) (a ∈ 0,1,2 ..., n _S; B ∈ 0,1,2 ..., n _A; C ∈ 1,2 ..., n _o) be that inverted pendulum is in s _a(∈ S) and be input as a _bImplementation and operation o when (∈ A) _cProbability p (the o of (∈ O) _c| s _a∩ a _b) in t value constantly, when The time, implementation and operation o is described _c(∈ O) and transfering state are s _dOrientation function value after (∈ S) diminishes, and promptly the orientation variation then obtains p _Abc(t+1)＜p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) reduces; When

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dOrientation function value after (∈ S) is constant, and promptly orientation is also constant, then obtains p _Abc(t+1)=p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) is constant; When

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dIt is big that orientation function value after (∈ S) becomes, and promptly orientation improves, and then obtains p _Abc(t+1)＞p _Abc(t), expression t+1 moment selection operation o _cThe probability of (∈ O) increases.

Specifically, use inverted pendulum and determine model emulation, when t imports θ (t)=0.046 He during the moment

The time inverted pendulum to left avertence, and drift angle acceleration direction is to the right the time, the inverted pendulum state is well s ₂If this moment is by the excitation probability p in the regular collection ₂₀₃(t)=p (o ₃| s ₂∩ a ₀)=0.335 is chosen is operating as o ₃, promptly a left side pushes away, again by the inverted pendulum equation of motion

\overset{\cdot \cdot}{θ} = \frac{m (m + M) gl}{(M + m) I + {Mml}^{2}} θ - \frac{ml}{(M + m) I + {Mml}^{2}} o_{k}

And difference equation

\begin{matrix} θ (t + 1) = θ (t) + τ \cdot \overset{\cdot}{θ} (t) \\ \overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + τ \cdot \overset{\cdot \cdot}{θ} (t) \end{matrix}

Try to achieve t+1

In the formula

K=1,2 ..., n _O

Moment θ (t+1)=0.031, promptly inverted pendulum t+1 state transitions constantly is better s ₁, the orientation function value ψ of this moment ₁₀(s ₁, a ₀)=1 is than t moment ψ ₂₀(s ₂, a ₀)=2 are little, obtain p ₂₀₃(t+1)＜p ₂₀₃(t), thus during next iterative learning at input a ₀And state s ₂Following selection operation o ₃The probability that a left side pushes away will reduce, and correspondingly, selects the probability of other two kinds of operations to increase.

Maxmin (p wherein _Abk(t+1), be to work as p 0,1) _Abk(t+1)＞1 o'clock p _Abk(t+1)=1; p _Abk(t+1)＜0 o'clock p _Abk(t+1)=0 can guarantee p _Abk(t+1) ∈ [0,1], and here

(5) whether the judgment experiment stop condition reaches.

When t+1 satisfies constantly | θ |≤0.005 and

And p _Abc(t+1)=p (o _c| s _a∩ a _b)＞0.95 o'clock thinks that then inverted pendulum can realize its autonomic balance control by study, then after inverted pendulum by this input of this state lasting selection operation o down _cUp to reaching iterations Tf _oOtherwise repeat (2)-experimental procedure of (4), till satisfying condition.

Fig. 6, Fig. 7 show, under similarity condition, determine that model or probabilistic model adopt the method for operant conditioning reflex automaton model can both successfully control the balance of inverted pendulum no matter be, obviously, because the introducing of random noise has increased the learning difficulty of probabilistic model, each test on average will can realize the control of inverted pendulum autonomic balance after 8800 times.

Claims

1, operant conditioning reflex automat, the operant conditioning reflex automat is designated hereinafter simply as OCM, it is characterized in that: be one eight tuple

OCM＝<A，S，O，Z，R，f，ψ，δ>，

Wherein,

(2) internal state of OCM set: S={s _i| i=0,1,2 ..., n _S, s _iBe i state symbol of OCM;

r _ijk：s _i×a _j→o _k(p _ijk)

Be that OCM is in s at state _i(∈ S) and be input as a _jUnder the condition of (∈ A) according to Probability p _IjkImplementation and operation o _k(∈ O), p _Ijk=p (o _k| s _i∩ a _j) be that OCM is in s at state _iBe input as a _iImplementation and operation o under the condition of ∈ A _kProbability, claim regular r again _IjkExcitation probability;

The state space equation of 15 (6) OCM:

f : \{\begin{matrix} f_{S} : S (t) \times A (t) \times O (t) &RightArrow; S (t + 1) \\ f_{Z} : S (t) \times A (t) \times O (t) &RightArrow; Z (t + 1) \end{matrix}

Wherein, f _SIt is the state transition equation of OCM, OCM t+1 state s (t+1) (∈ S) constantly by t constantly state s (t) (∈ S) and t imports a (t) (∈ A) constantly and t operation o (t) (∈ O) constantly is definite, irrelevant with its t state, input and operation before the moment, and, f _SBe unknown, but the result of OCM state transitions is OCM self observation; f _ZIt is the output equation of OCM, OCMt+1 output z (t+1) (∈ Z) constantly by t constantly state s (t) (∈ S) and t imports a (t) (∈ A) constantly and t operation o (t) (∈ O) constantly is definite, irrelevant with its t state and input and operation before constantly, the output of OCM is external world observation;

(7) the state orientation function of OCM: ψ: S * A → [h, q], h is defined as the poorest orientation function value of orientation, and q is that the best orientation function value of orientation is for arbitrary s _i(∈ S) and input a _j(∈ A), ψ _Ij=ψ (s _i, a _j) be that OCM is about state s _iWith input a _jExpectation value, if ψ _Ij＜0, then claim s _iBe that OCM is being input as a _jThe time negative state of orientation; If ψ _Ij=0, then claim s _iBe that OCM is being input as a _jThe time zero state of orientation; If ψ _Ij＞0, then claim s _iBe that OCM is being input as a _jThe time just orientation state;

(8) the operant conditioning reflex law of learning of OCM:

δ : \{\begin{matrix} &ForAll; k &NotEqual; c p_{abk} (t + 1) = p_{abk} (t) - ξ ({\overset{&RightArrow;}{ψ}}_{abk}) \cdot p_{abk} (t) \\ p_{abk} (t + 1) = \max \min (p_{abk} (t + 1), 0,1) \\ p_{abc} (t + 1) = 1 - \underset{k &NotEqual; c}{Σ} p_{abk} (t + 1) \end{matrix}

Regulate, wherein,

Be that OCM is in s at state _a(∈ S) and be input as a _bImplementation and operation o under the condition of (∈ A) _c(∈ O) back state transitions is s _d(∈ S) be the variable quantity of orientation function value afterwards, judges the quality of this operation with this variable quantity;

Be monotonic increasing function, and if only if for ξ (x)=0

R is the working rule sum, and λ is a learning rate, i.e. the speed of each iterative learning; p _Abc(t) (a ∈ 0,1,2 ..., n _S; B ∈ 0,1,2 ..., n _A; C ∈ 1,2 ..., n _O) be that the OCM state is in s _a(∈ S) and be input as a _bImplementation and operation o when (∈ A) _cProbability p (the o of (∈ O) _c| s _a∩ a _b) in t value constantly, when

The time, implementation and operation o is described _c(∈ O) and transfering state are s _dIt is big that orientation function value after (∈ S) becomes, and promptly orientation improves, then p _Abc(t+1)＞p _Abc(t), represent next selection operation o constantly _cThe probability of (∈ O) increases; Max min (p wherein _Abk(t+1), be to work as p 0,1) _Abk(t+1)＞1 o'clock p _Abk(t+1)=1; p _Abk(t+1)＜0 o'clock p _Abk(t+1)=0 guarantee p _Abk(t+1) ∈ [0,1], and

Promptly be illustrated in and take the probability of different operating under the same state of same input and be 1, when t → ∞, if p _Abc(t) → 1, description operation o _c(∈ O) is in s at state _a(∈ S) and be input as a _bBehavior optimum under the condition of (∈ A); When reaching iterations or work as a certain state, study is in s _a(∈ S) and be input as a _bImplementation and operation o under the condition of (∈ A) _cThe Probability p of (∈ O) _Abc(t) 〉=p _εIn time, stop to learn p _ε∈ [0.7,1].

2. the application of operant conditioning reflex automat as claimed in claim 1 in bionical autonomous learning control is characterized in that, comprises the steps:

(1) sets the starting condition of testing; The initial state s of given OCM (0), the initial input a (0) of given OCM, learning rate λ, each " condition-operation " regular r at random among the given R _Ijk(i ∈ 0,1,2 ..., n _S; J ∈ 0,1,2 ..., n _A; K ∈ 1,2 ..., n _O) initial excitation probability p _Ijk(0)=and 1/r, given iterative learning step number Tf or optimum behavior are selected Probability p _ε

(2) selection operation and implementation and operation at random; According to OCMt constantly state s (t) ∈ S and input a (t) ∈ A and R in each " condition-operation " regular r at random _Ijk(i ∈ 0,1,2 ..., n _S; J ∈ 0,1,2 ..., n _A; K ∈ 1,2 ..., n _O) excitation probability t value p constantly _Ijk(t), press t moment state probable value p of each operation down _Ijk(t) distribute, select t operation o (t) ∈ O constantly randomly; If OCM t is state s (t)=s constantly _a, input a (t)=a _b, choose t operation o (t)=o constantly _c, then the state of OCM is according to f _S: S (t) * A (t) * O (t) → S (t+1) state transition equation occurrence features shifts;

(3) operant conditioning reflex; If observe state s (t+1)=s _d∈ S, t+1 constantly then operant conditioning reflex unit δ to " condition-operation " regular r at random _AbcExcitation probability is regulated, r _AbcExcitation probability t+1 value constantly

δ : \{\begin{matrix} &ForAll; k &NotEqual; c p_{abk} (t + 1) = p_{abk} (t) - ξ ({\overset{&RightArrow;}{ψ}}_{abk}) \cdot p_{abk} (t) \\ p_{abk} (t + 1) = \max \min (p_{abk} (t + 1), 0,1) \\ p_{abc} (t + 1) = 1 - \underset{k &NotEqual; c}{Σ} p_{abk} (t + 1) \end{matrix}

Max min (p wherein _Abk(t+1), be to work as 0,1)

p _Abk(t+1)＞1 o'clock p _Abk(t+1)=1; p _Abk(t+1)＜0 o'clock p _Abk(t+1)=0; Guarantee p _Abk(t+1) ∈ [0,1]; And

Σ_{k = 1}^{n_{o}} p_{abk} (t) = 1;

(4) by the output equation f of system _Z: S (t) * A (t) * O (t) → Z (t+1) externally exports Z (t+1);