CN101599137A - Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior - Google Patents
Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior Download PDFInfo
- Publication number
- CN101599137A CN101599137A CNA2009100892633A CN200910089263A CN101599137A CN 101599137 A CN101599137 A CN 101599137A CN A2009100892633 A CNA2009100892633 A CN A2009100892633A CN 200910089263 A CN200910089263 A CN 200910089263A CN 101599137 A CN101599137 A CN 101599137A
- Authority
- CN
- China
- Prior art keywords
- msub
- mrow
- aoc
- state
- munderover
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011514 reflex Effects 0.000 title claims abstract description 25
- 230000006399 behavior Effects 0.000 title abstract description 17
- 230000031868 operant conditioning Effects 0.000 title abstract 5
- 230000006870 function Effects 0.000 claims abstract description 39
- 230000013016 learning Effects 0.000 claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 31
- 230000007704 transition Effects 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 7
- 230000008569 process Effects 0.000 claims description 26
- 230000001143 conditioned effect Effects 0.000 claims description 19
- 238000009472 formulation Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 5
- 238000004088 simulation Methods 0.000 abstract description 3
- 230000006978 adaptation Effects 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 17
- 241000272201 Columbiformes Species 0.000 description 10
- 239000010931 gold Substances 0.000 description 7
- 229910052737 gold Inorganic materials 0.000 description 7
- 235000003642 hunger Nutrition 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 3
- 241001303755 Porpita porpita Species 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 3
- 239000011664 nicotinic acid Substances 0.000 description 3
- 230000037351 starvation Effects 0.000 description 3
- 241001513109 Chrysocephalum apiculatum Species 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior belong to bionical field.A kind of autonomous operant conditioning reflex automat AOC relates to a kind of discrete calculation machine model of describing autonomous formula automaton, mainly comprise: operational set, state set, " condition-operation " regular collection, observable state transitions, and operant conditioning reflex law of learning, and, defined behavior entropy, stipulated the recurrence working procedure of AOC based on AOC state orientation value.The key character of AOC is to simulate biological operant conditioning reflex mechanism, thereby has a bionical self organizing function, comprise self study and adaptation function, can be used for describing, simulation designs various self-organizing systems, especially, should be applied to describe simulation, the various intelligent behaviors of planing machine robot system.
Description
Technical Field
The invention relates to an automaton, in particular to a bionic automaton based on an operation conditioned reflex principle.
Background
Automaton models for learning systems have been developed in 1960s and are referred to as learning automata, and over the last few years, it has been essential to change the structure of learning automata to meet different application requirements, typically both input and output. The invention is a self-organizing system based on the theory of Stefan gold operation conditioned reflex, and has self-learning and self-adapting functions. Since the 20 th century, the study of animals was initiated and two forms of study were proposed: one is classical conditioned reflex learning to shape the response behavior of an organism; another is operative conditioned reflex learning to shape the operational behavior of the organism. The western scientist believes that these two types of reflections are two distinct coupling processes: classical conditioned reflex is the process of S- -R ligation; operative conditioned reflex is the R- -S coupling process.
In recent decades, the academic interest of autonomous systems has increased year by year, and the total amount of literature associated with autonomous systems has increased year by year. The invention is an autonomous automaton, different from a non-autonomous automaton, the output of which does not need the drive of an external instruction and is made by the automaton according to the self-requirement. Related patents are as follows: a picture generating method named as a customer operation type automaton and the customer operation type automaton with application number of 98115560.X, a regular expression matching acceleration method named as a finite automaton with memory determination based on application number of 200710071071.0 and the like are all used for realizing a certain function by intersecting the automaton with an external environment. At present, an autonomous operation conditional reflection automaton has not been available.
The invention provides an abstract self-organizing model based on a Stefan operating conditioned reflex theory, which is used for describing, simulating and designing various self-organizing systems to show self-learning and self-adaptive characteristics.
Disclosure of Invention
The invention provides an autonomous operation conditional reflection automaton which can be used for describing, simulating and designing self-organization (including self-learning and self-adaption).
The operating conditional reflection automaton of the present invention is a nine-tuple comprising: an input symbol set, an internal state set, an internal operation set, an output symbol set, a random 'condition-operation' rule set, a state transition unit, an observation unit, a state orientation unit, and an operation conditional reflection learning unit, and a recursive execution procedure of the AOC is specified. The AOC is characterized in that the AOC simulates the operation conditioned reflex mechanism of organisms, thereby having bionic self-organization functions, including self-learning and self-adaption functions, and being used for describing, simulating and designing various self-organization systems with interaction functions.
A general finite state automaton is a five-tuple: FA ═ a, Z, S, f, g }. Where a represents a finite input symbol set, S represents a finite (internal) state symbol set (S (0) ∈ S being the initial state), Z represents a finite output (acceptance state) symbol set, f: s × A → S denotes the state transition function, g: s → Z represents the output function. The finite state automata FA is a non-autonomous system.
The AOC and finite state automata FA appear to be unequivalent in the sense that the operator symbols in the AOC are not equivalent to the input symbols in the finite state automata FA, which represent the internal operations of the AOC, and the input symbols in the FA represent the external instructions. The set of operation symbols Ω in the AOC is not the set of input symbols in the FA, but the internal operation of the AOC. The set of input symbols in the FA is actually the set of instructions that may be input externally. There is no output symbol set in AOC, and naturally there is no output function. As an autonomous system, an AOC requires a set of output symbols and an output function. The autonomous system may or may also need to act on the environment or the objective world. From the form of the state space equation, the output is a combination of states, or a combination of states and operations, and therefore, it can be said that the internal state set of the AOC is itself a set of output symbols, and the state of the AOC is observable; by "the state of the AOC is observable" is meant that the AOC itself has a receptor and is able to detect changes in its state, and does not mean that the outside world is able to observe these quantities; the autonomous automaton also needs output, the output does not need to be driven by an external instruction, and the autonomous automaton is made according to the needs of the autonomous automaton.
Compared with a non-autonomous automaton, the autonomous automaton has the advantages that the output of the autonomous automaton does not need to be driven by an external instruction, the autonomous automaton can perform certain action on the environment according to the self requirement, namely, even if the external environment changes, the autonomous automaton can still work normally, and the non-autonomous automaton needs to change a structural model or parameters to adapt to the change of the external environment. A non-autonomous system can always be converted into an autonomous system, and then an autonomous operation conditional reflection AOC can always be found to correspond to a corresponding non-autonomous operation conditional reflection automaton. Autonomous operation conditional reflection AOC are more widely used.
In information theory, entropy can be used as a measure of uncertainty for some event. The larger the information quantity is, the more regular the architecture is, the more perfect the function is, and the smaller the entropy is. By utilizing the concept of entropy, the measurement, transmission, transformation and storage of information are researched theoretically. The invention introduces the concept of the operation entropy to prove the convergence of the AOC operation entropy psi (t), and the self-organization process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty, so that the self-organization characteristic of the AOC is clarified, and the AOC has self-learning and self-adapting functions.
The invention provides an autonomous operation conditioned reflex automaton, which is used for simulating an animal experiment of Stefan to prove that the automaton realizes a mechanism of simulating operation conditioned reflex learning, and simultaneously realizes balance control of a two-wheeled self-balancing robot, which shows that AOC can be used for designing various intelligent behaviors of a robot system.
The automaton of the invention is a nine-tuple autonomous operation conditional reflection automaton:
AOC=<t,Ω,S,Γ,δ,ε,η,ψ,s0>
wherein
(1) Discrete time of AOC: t e {0, 1, 2, …, n t0 is the starting time of AOC;
(2) set of operation symbols for AOC: q ═ αk|k=1,2,…,nΩ},αkThe kth operation symbol of AOC;
(3) state set of AOC: s ═ Si|i=0,1,2,…,nS},siIs the ith state of the AOC;
(4) AOC operation rule set: Γ ═ rik(p)|p∈P;i∈{0,1,2,…,nS};k∈{0,1,2,…,nΩ} random 'condition-operation' rule rik(p):si→αk(p) means that the AOC is in s stateiImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) I.e. AOC is in state siUnder the conditions of operation akP represents PikA set of (a);
(5) state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the state S (t +1) ∈ S at the time of AOC t +1 being determined by the state S (t) ∈ S at the time t and the operation α (t) ∈ S at the time t, regardless of the state and operation before the time t; δ the state transition process determined is known or unknown, but the results of its state transition can be observed;
(6) orientation function of AOC: epsilon: s → E ═ εi|i=0,1,2,…,nS},εi=ε(si) E is as state siAn orientation value of S;
(7) operating conditions of AOCLearning law of shooting:adjusting the operating rule rikThe probability of implementation of (P) ∈ Γ P: assuming that the state at time t is s (t), an operation α (t) ∈ Ω and the state s (t +1) observed at time t +1 are carried out, and according to the theory of conditioned operation of gold, if ∈ (s (t +1)) — ∈ (s (t))) < 0, p (α (t) | s (t)) tends to decrease, whereas if ∈ (s (t +1)) — ∈(s) (t)) > 0, p (α (t) | s (t)) tends to increase. At time t, the AOC is in the state s (t) siAnd the current selection operation α (t) ═ αkAt the same time, according to the state transition transfer function, the state s (t +1) at the next moment is sjSimulating the operating conditioned reflex mechanism of the living being, the probability of the current operation at the next moment, i.e. at the moment t +1, is changed, its value is increased by Δ, where Δ is related to the orientation value ∈, the larger the orientation value indicates the better result of the operation, and the larger Δ is, the probabilities of the remaining operations at the moment t +1 are correspondingly subtracted by a value, and the sum of the subtracted values is exactly Δ, and the probability of the operation at the last moment is the ratio of the sum of the operations (excluding the operation selected at the moment t) multiplied by Δ. This ensures that the sum of the probabilities of selecting the individual operations at each instant is 1. More formally described by the formula: when s (t) is equal to si、α(t)=αkAnd s (t +1) ═ sjThen pik(t+1)=pik(t) + Δ, the probability of other operations being denoted piu(t+1)=piu(t) - Δ ξ, where u denotes 0 to nΩAny value between k is not equal to k. Wherein p isik(t) is AOC state at siE.s condition to perform operation alphakThe value of the probability of belonging to omega at the time t; p is a radical ofik(t +1) is the AOC state at siE.s condition to perform operation alphakThe value of the probability of belonging to omega at the moment t + 1;and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value;is a monotonically increasing function, satisfiesIf and only if x is 0; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mrow>
</math>
where v represents 0 to nΩAll values of k not equal to k in between,indicates that the AOC state is at siE.s condition to perform operation alphauThe value of the sum of the probabilities belonging to omega at the time t; p is a radical ofiu(t) is AOC state at siE.s condition to perform operation alphauValue of the probability of e.omega at time t, piu(t +1) is the AOC state at siE.s condition to perform operation alphauThe value of the probability of e Ω at time t + 1.
(8) Operating entropy of AOC: psi: p × E → R+,R+Is a positive real number set, the operating entropy ψ (t) of the AOC at time t indicates that the state at time t is at siSum of operating entropies under conditions:
it is in the state s (t) s from time tiThe set of operating probabilities and the set of orientation functions under the condition. Psii(t) is AOC atState siOperating entropy under conditions:
knowing the operation entropy of each state and weighting and summing the operation entropy to obtain the operation entropy of the AOC at the time t
(9) Initial state of AOC: s0=s(0)∈S。
The invention is characterized in that the operation conditioned reflex mechanism of the simulation organism has bionic self-organization function, including self-learning and self-adapting function, and can be used for describing, simulating and designing various self-organization systems.
The AOC of the autonomous operation conditional reflecting automaton of the invention operates recursively according to the following procedural steps:
(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a, and giving the initial operation probability pik(0)=1/nΩ(i=0,1,2,…,nS;k=1,2,…,nΩ) (ii) a Given down time Tf;
(2) Selecting operation: depends on the rule r in the "condition-operation" rule set Γ in the operation set Γik(p):si→αk(p) i.e. AOC in its state is in siImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) Is that the AOC is in its state at siUnder the conditions of operation akRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;
(3) the implementation operation is as follows: at time t, the AOC is in a state S (t) e S, the operation α (t) e Ω selected in the previous step is performed, and the current state is shifted by δ (S (t)), α (t)) - δ (S (t))i,αk);
(4) And (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the result of the state transition being fully observable, i.e., the presence of j ∈ {0, 1, 2, …, nSSuch that s (t +1) is sj;
(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditionsAdjusting the operating rule rikThe implementation probability P ∈ P of (P) ∈ Γ. time t s (t) siAnd α (t) ═ αkThen the probability of operation at time t +1 depends on
And (6) updating. Wherein,and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
(6) calculating an operation entropy: formulation according to defined operational entropy
(7) And (3) recursive transition: if T + 1. ltoreq.TfThen t ═ t +1 and repeat (2) - (7);
(8) when T +1 > TfAnd stopping the machine.
The flow chart of the method of the invention is shown in FIG. 2.
Drawings
FIG. 1 is a schematic structural diagram of an autonomous operation conditional reflection automaton according to the present invention;
t is the discrete time (1) and Ω is the operation αk(k=1,2,…,nΩ) S is the state Si(i=0,1,2,…,nS) Is a state transfer function (4), Γ is a "condition-operation" rule rik(i∈{0,1,2,…,nS};k∈{1,2,…,nΩ}), ε is the orientation function (6), η is the conditioned reflex learning law (7), ψ is the behavioral entropy (8), s0Is the initial state (9).
FIG. 2 is a flow chart of an AOC program of the autonomous operation conditional reflecting automaton;
FIG. 3 is a probability curve of the behavior of mouse;
FIG. 4 is a graph of the operating entropy of the mouse experiment;
FIG. 5 is an operation behavior probability curve of a machine pigeon;
FIG. 6 is an operation entropy curve of a machine pigeon experiment;
fig. 7 is a graph showing probability of each operation behavior of the two-wheeled self-balancing robot in an upright state, that is, when the deflection angle θ is 0 °;
FIG. 8 shows probability curves of operation behaviors of the two-wheeled self-balancing robot when the deflection angle is more than 0 degrees and less than 12 degrees;
fig. 9 is a graph showing probability of each operation behavior of the two-wheeled self-balancing robot when the deflection angle θ is 12 °;
FIG. 10 shows probability curves of operation behaviors of the two-wheeled self-balancing robot when the deflection angle is larger than-12 degrees and smaller than theta and smaller than 0 degree;
fig. 11 is a graph showing probability of each operation behavior of the two-wheeled self-balancing robot when the deflection angle θ is-12 °;
FIG. 12 is an operation entropy curve of a two-round self-balancing robot experiment;
examples
The first embodiment is as follows: one minimal system was a learning mouse, a rat experiment simulating scant. Briefly describe the mouse experiment with sper gold: a white mouse is placed in the Steiner box, and a lever is arranged, so that the structure of the box can eliminate all external stimuli as far as possible. The mouse can move freely in the box, when the lever is pressed, a group of food falls into a plate below the box, and the mouse can eat the food. A device is arranged outside the box to record the action of the animal. The mouse learns to press the lever continuously and obtains food reward through self action. This experiment was carried out byThe rat experiment of the Skinson is realized by autonomously operating a conditional reflex automaton. The white mouse has two operation behaviors, namely pressing the lever alpha1And the other is a non-pressure lever alpha2I.e. the set of operations Ω ═ { α ═ α1,α2And the probabilities are respectively represented by p1 and p 2. Its state set S ═ S0,s1},s0Indicating a state of hunger, s1Indicating a non-starvation condition. The operation rule is as follows: Γ ═ rik(P) | P ∈ P; i belongs to {0, 1 }; k ∈ {0, 1} }, random "condition-operation" rule rik(p):si→αk(p) means that the AOC is in s stateiImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) I.e. AOC is in state siUnder the conditions of operation akThe probability value of (2). Its state transfer function: δ: s (t) × Ω (t) → S (t +1), in particular:
s0×p1→s1,s0×p2→s0,s1×p1→s1,s1×p2→s0. Its orientation function: epsilon: s → E ═ εi|i=0,1},εi=ε(si) E is as state siOrientation value of S and defining
<math>
<mrow>
<mi>Δ</mi>
<mo>=</mo>
<mi>a</mi>
<mo>×</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>×</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>p</mi>
<mn>1</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
</math>
Where a is the learning rate,is the increment of the orientation value. The probability of the two behaviors at the initial moment is 0.5, and the mouse can obtain reward as long as pressing the leverThe probability of pressing the lever at any time is increased, namely the probability of selectively pressing the lever by the white mouse at the next time is increased, and the probability reflects the learning law according to the operating conditionsUpdating, repeatedly learning, and selecting the probability p of lever pressing by the white mouse1Are becoming larger and larger. The learning rate a of the experiment is 0.01, and after 668 steps of learning, the mouse learns to press the lever to obtain food, and as can be easily seen from the attached figure 3, the probability p that the mouse presses the lever1Eventually tending towards 1. During the course of the experiment, according to the formula of the defined operation entropy
<math>
<mrow>
<mi>ψ</mi>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>S</mi>
</msub>
</munderover>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>Ω</mi>
</msub>
</munderover>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>S</mi>
</msub>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>Ω</mi>
</msub>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>α</mi>
<mi>k</mi>
</msub>
<mo>|</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>α</mi>
<mi>k</mi>
</msub>
<mo>|</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
The operation entropy at each time is calculated, and the operation entropy ψ (t) of the AOC becomes smaller and smaller with the passage of time and tends to be minimum at t → ∞, see fig. 4, which shows that the operation entropy ψ (t) of the AOC is convergent. The AOC is a self-organizing system based on the theory of Stefan-gold operation conditioned reflex, and has self-learning and self-adapting functions. The self-organizing process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty. Now that the convergence of the AOC operation entropy ψ (t) has been demonstrated, the self-organizing properties of AOC are also elucidated.
The specific implementation steps of the experiment are as follows:
(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a to 0.01, and giving the initial operation probability pik(0) 0.5 (i-0, 1; k-1, 2), given a downtime Tf=1000;
(2) Selecting operation: according to rule Γ in "condition-operation" rule set Γ in operation set Γ ═ r { r ═ rik(P) | P ∈ P; i belongs to {0, 1 }; k ∈ {1, 2} }, random "condition-operation" rule rik(p):si→αk(p) i.e. AOC in its state is in siImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) Is that AOC is at s in the 0 stateiUnder the conditions of operation akRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;
(3) the implementation operation is as follows: at time t, AOC is in a state S (t) e S, the selected operation alpha (t) e omega in the last step is implemented, and the current state is determined according to delta: s (t) × Ω (t) → S (t +1), in particular:
s0×p1→s1,s0×p2→s0,s1×p1→s1,s1×p2→s0a transfer occurs;
(4) and (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the state transition process of which is known or unknown, but the result of which is fully observable, i.e., the presence of j ∈ {0, 1} causes S (t +1) ═ Sj;
(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditionsAdjusting the operating rule rikThe implementation probability P ∈ P of (P) ∈ Γ. time t s (t) siAnd α (t) ═ αkThen the probability of operation at time t +1 depends on
And (6) updating. Wherein,and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
(6) calculating an operation entropy: formulation according to defined operational entropy
(7) And (3) recursive transition: if T + 1. ltoreq.TfThen t ═ t +1 and repeat (2) - (7);
(8) when T +1 > TfAnd stopping the machine.
Example two: the machine pigeon with learning ability simulates the pigeon experiment of the Skino. In this experiment, the pigeon was fed with food when pecking the red button (positive reinforcement stimulus), without any stimulus when pecking the yellow button, and given an electric shock when pecking the blue button (negative reinforcement stimulus), and the pigeon was randomized at the beginning with the red, yellow and blue buttons. After a while, the pigeon pecked the red button significantly more often than the other two buttons. Defining an autonomous operating condition of 3-operation 3-state for a machine pigeonReflection automaton having an operation set Ω ═ α0,α1,α2Are elements of a red pecking button alpha respectively0And a yellow pecking button alpha1And a blue pecking button alpha2The probabilities are represented by p0, p1, and p2, respectively. Set of states S ═ S0,s1,s2I.e. zero starvation status (non-starvation status) s0Semi-hungry state s1Starvation state s2The state transition rule is:
δ(s0×α0)=s0 δ(s0×α1)=s1 δ(s0×α2)=s1
δ(s1×α0)=s0 δ(s1×α1)=s2 δ(s1×α2)=s2
δ(s2×α0)=s1 δ(s2×α1)=s2 δ(s2×α2)=s2
is shown in tabular form in table 1 below. Its orientation function: epsilon: s → E ═ εi|i=0,±0.5,±1},εi=ε(si) E is as state siOrientation value of S and defining
<math>
<mrow>
<mi>Δ</mi>
<mo>=</mo>
<mi>a</mi>
<mo>×</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>×</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>p</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
</math>
Orientation thereof: s0→s0: zero orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>00</mn>
</msub>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s0→s1: zero orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>01</mn>
</msub>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s1→s0: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>10</mn>
</msub>
<mo>=</mo>
<mn>0.5</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s1→s2: negative orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>12</mn>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>0.5</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s2→s1: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>21</mn>
</msub>
<mo>=</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s2→s2: negative orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>22</mn>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
</math>
Reflection law of learning according to operating conditionsThe current operation is rewarded
<math>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>></mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</math>
The corresponding implementation probability tends to increase, and the implementation probabilities of other operations correspondingly decrease; the current operation is rewarded
<math>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</math>
The probability of all operations is not changed; the current operation is rewarded
<math>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo><</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</math>
The corresponding implementation probability tends to decrease and the implementation probability of other operations increases accordingly. The initial probability of each operation is 1/3, and after about 5000 steps of learning, the machine pigeon basically pecks only the red button, but not the red button and the blue button, and it can be seen from fig. 5 that the probability p0 of the machine pigeon pecking the red button tends to 1, and the probability p1 of pecking the yellow button and the probability p2 of pecking the blue button tend to 0.
TABLE 1 State transition of machine pigeons
In the course of the experiment, each time is according to the formula of the defined operation entropy
The specific implementation steps of the experiment are as follows:
(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a to 0.01, and giving the initial operation probability pik(0) 1/3 (i-0, 1, 2; k-0, 1, 2); given down time Tf=5000;
(2) Selecting operation: rule set of 'conditional-operation' in operation set gamma
Γ={rik(P) | P ∈ P; i belongs to {0, 1, 2 }; k ∈ {0, 1, 2} }, a rule r of random "condition-operations }ik(p):si→αk(p) i.e. AOC in its state is in siImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) Is that the AOC is in its state at siUnder the conditions of operation akRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;
(3) the implementation operation is as follows: at time t, AOC is in a state S (t) e S to implement the operation α (t) e Ω selected in the previous step, and the current state δ: s (t) × Ω (t) → S (t +1), in particular:
δ(s0×α0)=s0 δ(s0×α1)=s1 δ(s0×α2)=s1
δ(s1×α0)=s0 δ(s1×α1)=s2 δ(s1×α2)=s2
δ(s2×α0)=s1 δ(s2×α1)=s2 δ(s2×α2)=s2
(4) and (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), and although the state transition process is known or unknown, the result of the state transition is fully observable, i.e., there is a state transition
j ∈ {0, 1, 2} such that s (t +1) ═ sj;
(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditionsAdjusting the operating rule rikThe implementation probability P ∈ P of (P) ∈ Γ. time t s (t) siAnd α (t) ═ αkThen the probability of operation at time t +1 depends on
And (6) updating. Wherein,and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
(6) calculating an operation entropy: formulation according to defined operational entropy
(7) And (3) recursive transition: if T + 1. ltoreq.TfThen t ═ t +1 and repeat (2) - (7);
(8) when T +1 > TfAnd stopping the machine.
Example three: and realizing the balance control of the two-wheeled self-balancing robot through an autonomous operation conditional reflection automaton. The two-wheeled upright robot can freely move left and right on the flat ground. When the deflection angle exceeds +/-12 degrees, the robot loses balance. The state set of the AOC automaton designed for the purpose is a robot deflection angle, and comprises 6 states: theta is 0 DEG, 0 DEG < theta < 12 DEG,Theta is 12 DEG, -12 DEG < theta < 0 DEG, -12 DEG, -theta > 12 DEG, respectively0、s1、s2、s3、s4、s5、s6Thus, the state set S ═ { S ═ S0,s1,s2,s2,s3,s4,s5,s6,}. Its operating set Ω ═ { α ═ α0,α1,α2Includes not moving alpha0And to the right by alpha1Leftward movement by α2. The state transition rules are as follows:
δ(s0×α0)=s0 δ(s0×α1)=s3 δ(s0×α2)=s1
δ(s1×α0)=s2 δ(s1×α1)=s0 δ(s1×α2)=s2
δ(s2×α0)=s5 δ(s2×α1)=s1 δ(s2×α2)=s5
δ(s3×α0)=s4 δ(s3×α1)=s4 δ(s3×α2)=s0
δ(s4×α0)=s5 δ(s4×α1)=s5 δ(s4×α2)=s3
see table 2. Its orientation function: epsilon: s → E ═ εi|i=0,±0.5,±1},εi=ε(si) E is as state siE.g. orientation value of S, at the same time
<math>
<mrow>
<mi>Δ</mi>
<mo>=</mo>
<mi>a</mi>
<mo>×</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>×</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
</math>
Orientation thereof: s0→s0: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>00</mn>
</msub>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s0→s3: zero orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>03</mn>
</msub>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s0→s1: zero orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>01</mn>
</msub>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s1→s0: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>10</mn>
</msub>
<mo>=</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s1→s2: negative orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>12</mn>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>0.5</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s2→s1: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>21</mn>
</msub>
<mo>=</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s2→s5: negative orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>25</mn>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s3→s4: negative orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>34</mn>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>0.5</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s3→s0: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>30</mn>
</msub>
<mo>=</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s4→s5: negative orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>45</mn>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
s4→s3: is in a positive orientation
<math>
<mrow>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mn>43</mn>
</msub>
<mo>=</mo>
<mn>1.0</mn>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
</math>
Wherein p isikIndicating that the robot is in state siOperation akThe probability of (c). Reflection law of learning according to operating conditionsThe probability is continuously updated. The initial probability is 1/3, and after learning of about 1500 steps, the robot can select good operation with probability close to 1 in each state, and keep self balance, and in the former 5 states, the robot can generally select good operation to make theta tend to 0 degrees, as can be seen from fig. 7-11. In the course of the experiment, each time is according to the formula of the defined operation entropy
TABLE 2 State transition and orientation mechanism for two-wheeled self-balancing robot
The specific implementation steps of the experiment are as follows:
(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a to 0.01, and giving the initial operation probability pik(0) 1/3 (i-0, 1, 2; k-0, 1, 2); given down time Tf=1500;
(2) Selecting operation: by the "condition-operation" rule in the operation set Γ
Γ={rik(P) | P ∈ P; i belongs to {0, 1, 2, 3, 4 }; k ∈ {0, 1, 2} }, a rule r of random "condition-operations }ik(p):si→αk(p) i.e. AOC in its state is in siImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) Is that the AOC is in its state at siUnder the conditions of operation akRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;
(3) the implementation operation is as follows: at time t, AOC is in a state S (t) e S to implement the operation α (t) e Ω selected in the previous step, and the current state δ: s (t) × Ω (t) → S (t +1), in particular:
δ(s0×α0)=s0 δ(s0×α1)=s3 δ(s0×α2)=s1
δ(s1×α0)=s2 δ(s1×α1)=s0 δ(s1×α2)=s2
δ(s2×α0)=s5 δ(s2×α1)=s1 δ(s2×α2)=s5
δ(s3×α0)=s4 δ(s3×α1)=s4 δ(s3×α2)=s0
δ(s4×α0)=s5 δ(s4×α1)=s5 δ(s4×α2)=s3
a transfer occurs;
(4) and (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), and although the state transition process is known or unknown, the result of the state transition is fully observable, i.e., there is a state transition
j ∈ {0, 1, 2, 3, 4} such that s (t +1) ═ sj;
(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditionsAdjusting the operating rule rikThe implementation probability P ∈ P of (P) ∈ Γ. time t s (t) siAnd α (t) ═ αkThen the probability of operation at time t +1 depends on
And (6) updating. Wherein,and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
here, the optimal operation in each state is different, so that the probabilities corresponding to the different operations in each state are calculated, and there are 15 probabilities in total.
(6) Calculating an operation entropy: formulation according to defined operational entropy
(7) And (3) recursive transition: if T + 1. ltoreq.TfThen t ═ t +1 and repeat (2) - (7);
(8) when T +1 > TfAnd stopping the machine.
Claims (2)
1. An autonomous operation conditional reflection automaton, AOC for short, is a nine-tuple:
wherein
(1) Discrete time of AOC: t e {0, 1, 2, …, nt0 is the starting time of AOC;
(2) set of operation symbols for AOC: q ═ αk|k=1,2,…,nΩ},αkThe kth operation symbol of AOC;
(3) state set of AOC: s ═ Si|i=0,1,2,…,nS},siIs the ith state of the AOC;
(4) AOC operation rule set: Γ ═ rik(p)|p∈P;i∈{0,1,2,…,nS};k∈{0,1,2,…,nΩ} random 'condition-operation' rule rik(p):si→αk(p) means that the AOC is in s stateiImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) I.e. AOC is in state siUnder the conditions of operation akP represents PikA set of (a);
(5) state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the state S (t +1) ∈ S at the time of AOC t +1 being determined by the state S (t) ∈ S at the time t and the operation α (t) ∈ S at the time t, regardless of the state and operation before the time t; δ the state transition process determined is known or unknown, but the results of its state transition can be observed;
(6) orientation function of AOC: epsilon: s → E ═ εi|i=0,1,2,…,nS},εi=ε(si) E is as state siAn orientation value of S;
(7) operating condition of AOC reflects learning law:simulating the biological operating conditioned reflex mechanism, adjusting the operating rule rikThe probability of implementation of (p) ∈ Γ, assuming that the state at time t is s (t) ═ siCarrying out an operation α (t) ═ αkE Ω, s (t +1) ═ s observed at time t +1jThen the probability of operation at time t +1 depends on
Updating is carried out; wherein, pik(t) is AOC state at siE.s condition to perform operation alphakThe value of the probability of belonging to omega at the time t; p is a radical ofik(t +1) is the AOC state at siE.s condition to perform operation alphakThe value of the probability of belonging to omega at the moment t + 1;and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value;is a monotonically increasing function, satisfiesIf and only if x is 0; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mrow>
</math>
where u represents 0 to nΩAny value between which k is not equal to k,indicates that the AOC state is at siE.s condition to perform operation alphauE.g. the sum of the probabilities of Ω at time t, v denotes 0 to nΩAll values between which k is not equal; p is a radical ofiu(t) is AOC state at siE.s condition to perform operation alphauValue of the probability of e.omega at time t, piu(t +1) is the AOC state at siE.s condition to perform operation alphauThe value of the probability of belonging to omega at the moment t + 1;
(8) operating entropy of AOC: psi: p × E → R+,R+Is a positive real number set, the operating entropy ψ (t) of the AOC at time t indicates that the state at time t is at siSum of operating entropies under conditions:
it is in the state s (t) s from time tiDetermining an operation probability set and an orientation function set under the condition; psii(t) is AOC in state siOperating entropy under conditions:
knowing the operation entropy of each state and weighting and summing the operation entropy to obtain the operation entropy of the AOC at the time t
<math>
<mrow>
<mi>ψ</mi>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>S</mi>
</msub>
</munderover>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>Ω</mi>
</msub>
</munderover>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>S</mi>
</msub>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>Ω</mi>
</msub>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>α</mi>
<mi>k</mi>
</msub>
<mo>|</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>α</mi>
<mi>k</mi>
</msub>
<mo>|</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
Wherein, p(s)i) Is AOC state siThe value of the probability of occurrence of e S at time t, p (α)k|si) Is AOC state at siE.s condition to perform operation alphakThe value of the probability of belonging to omega at the time t;
(9) initial state of AOC: s0=s(0)∈S。
2. Autonomous operating conditional reflecting automaton AOC according to claim 1, characterized in that it operates recursively according to the following procedural steps:
(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a, and giving the initial operation probability pik(0)=1/nΩ(i=0,1,2,…,nS;k=1,2,…,nΩ) (ii) a Given down time Tf;
(2) Selecting operation: in the "condition-operation" rule set Γ of the operation set ΓRule rik(p):si→αk(p) i.e. AOC in its state is in siImplementing operation alpha according to probability P ∈ P under condition of ∈ Sk∈Ω,p=pik=p(αk|si) Is that the AOC is in its state at siUnder the conditions of operation akRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;
(3) the implementation operation is as follows: at time t, the AOC is in a state S (t) e S, the operation α (t) e Ω selected in the previous step is performed, and the current state is shifted by δ (S (t)), α (t)) - δ (S (t))i,αk);
(4) And (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the result of the state transition being fully observable, i.e., the presence of j ∈ {0, 1, 2, …, nSSuch that s (t +1) is sj;
(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditionsAdjusting the operating rule rik(P) an implementation probability P ∈ P of Γ; time t s (t) siAnd α (t) ═ αkThen the probability of operation at time t +1 depends on
Updating is carried out; wherein,and 0. ltoreq. pik+Δ≤1;
<math>
<mrow>
<msub>
<mover>
<mi>ϵ</mi>
<mo>→</mo>
</mover>
<mi>ij</mi>
</msub>
<mo>=</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>ϵ</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</math>
I.e. the increment of the orientation value; a is the learning rate;
<math>
<mrow>
<mi>ξ</mi>
<mo>=</mo>
<msub>
<mi>p</mi>
<mi>iu</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>/</mo>
<munder>
<mi>Σ</mi>
<mrow>
<mi>v</mi>
<mo>≠</mo>
<mi>k</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>iv</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
(6) calculating an operation entropy: formulation according to defined operational entropy
<math>
<mrow>
<mi>ψ</mi>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>S</mi>
</msub>
</munderover>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>Ω</mi>
</msub>
</munderover>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<msub>
<mi>p</mi>
<mi>ik</mi>
</msub>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>S</mi>
</msub>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>Ω</mi>
</msub>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>α</mi>
<mi>k</mi>
</msub>
<mo>|</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>α</mi>
<mi>k</mi>
</msub>
<mo>|</mo>
<msub>
<mi>s</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
Calculating the operation entropy at the time t, wherein p(s)i) Is AOC state siThe value of the probability of occurrence of e S at time t, p (α)k|si) Is AOC state at siE.s condition to perform operation alphakThe value of the probability of belonging to omega at the time t;
(7) and (3) recursive transition: if T + 1. ltoreq.TfThen t ═ t +1 and repeat (2) - (7);
(8) when T +1 > TfAnd stopping the machine.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2009100892633A CN101599137A (en) | 2009-07-15 | 2009-07-15 | Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2009100892633A CN101599137A (en) | 2009-07-15 | 2009-07-15 | Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101599137A true CN101599137A (en) | 2009-12-09 |
Family
ID=41420574
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2009100892633A Pending CN101599137A (en) | 2009-07-15 | 2009-07-15 | Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101599137A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103792846A (en) * | 2014-02-18 | 2014-05-14 | 北京工业大学 | Robot obstacle avoidance guiding method based on Skinner operating condition reflection principle |
CN104570738A (en) * | 2014-12-30 | 2015-04-29 | 北京工业大学 | Robot track tracing method based on Skinner operant conditioning automata |
CN104614988A (en) * | 2014-12-22 | 2015-05-13 | 北京工业大学 | Cognitive and learning method of cognitive moving system with inner engine |
CN105094124A (en) * | 2014-05-21 | 2015-11-25 | 防灾科技学院 | Method and model for performing independent path exploration based on operant conditioning |
CN105205533A (en) * | 2015-09-29 | 2015-12-30 | 华北理工大学 | Development automatic machine with brain cognition mechanism and learning method of development automatic machine |
WO2017114130A1 (en) * | 2015-12-31 | 2017-07-06 | 深圳光启合众科技有限公司 | Method and device for obtaining state of robot |
CN108846477A (en) * | 2018-06-28 | 2018-11-20 | 上海浦东发展银行股份有限公司信用卡中心 | A kind of wisdom brain decision system and decision-making technique based on reflex arc |
CN109212975A (en) * | 2018-11-13 | 2019-01-15 | 北方工业大学 | A kind of perception action cognitive learning method with developmental mechanism |
CN111464707A (en) * | 2020-03-30 | 2020-07-28 | 中国建设银行股份有限公司 | Outbound call processing method, device and system |
-
2009
- 2009-07-15 CN CNA2009100892633A patent/CN101599137A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103792846A (en) * | 2014-02-18 | 2014-05-14 | 北京工业大学 | Robot obstacle avoidance guiding method based on Skinner operating condition reflection principle |
CN103792846B (en) * | 2014-02-18 | 2016-05-18 | 北京工业大学 | Based on the robot obstacle-avoiding air navigation aid of Skinner operant conditioning reflex principle |
CN105094124A (en) * | 2014-05-21 | 2015-11-25 | 防灾科技学院 | Method and model for performing independent path exploration based on operant conditioning |
CN104614988B (en) * | 2014-12-22 | 2017-04-19 | 北京工业大学 | Cognitive and learning method of cognitive moving system with inner engine |
CN104614988A (en) * | 2014-12-22 | 2015-05-13 | 北京工业大学 | Cognitive and learning method of cognitive moving system with inner engine |
CN104570738A (en) * | 2014-12-30 | 2015-04-29 | 北京工业大学 | Robot track tracing method based on Skinner operant conditioning automata |
CN105205533A (en) * | 2015-09-29 | 2015-12-30 | 华北理工大学 | Development automatic machine with brain cognition mechanism and learning method of development automatic machine |
CN105205533B (en) * | 2015-09-29 | 2018-01-05 | 华北理工大学 | Development automatic machine and its learning method with brain Mechanism of Cognition |
WO2017114130A1 (en) * | 2015-12-31 | 2017-07-06 | 深圳光启合众科技有限公司 | Method and device for obtaining state of robot |
CN106926236A (en) * | 2015-12-31 | 2017-07-07 | 深圳光启合众科技有限公司 | The method and apparatus for obtaining the state of robot |
CN106926236B (en) * | 2015-12-31 | 2020-06-30 | 深圳光启合众科技有限公司 | Method and device for acquiring state of robot |
CN108846477A (en) * | 2018-06-28 | 2018-11-20 | 上海浦东发展银行股份有限公司信用卡中心 | A kind of wisdom brain decision system and decision-making technique based on reflex arc |
CN108846477B (en) * | 2018-06-28 | 2022-06-21 | 上海浦东发展银行股份有限公司信用卡中心 | Intelligent brain decision system and decision method based on reflection arcs |
CN109212975A (en) * | 2018-11-13 | 2019-01-15 | 北方工业大学 | A kind of perception action cognitive learning method with developmental mechanism |
CN111464707A (en) * | 2020-03-30 | 2020-07-28 | 中国建设银行股份有限公司 | Outbound call processing method, device and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101599137A (en) | Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior | |
Lyu et al. | SDRL: interpretable and data-efficient deep reinforcement learning leveraging symbolic planning | |
US9613310B2 (en) | Neural network learning and collaboration apparatus and methods | |
Mishra et al. | Prediction and control with temporal segment models | |
US11762679B2 (en) | Information processing device, information processing method, and non-transitory computer-readable storage medium | |
JP2020204803A (en) | Learning method and program | |
EP3587045A1 (en) | Method and device for the computer-aided determination of control parameters for favourable handling of a technical system | |
CN116560239B (en) | Multi-agent reinforcement learning method, device and medium | |
Agrawal | The task specification problem | |
CN114063446A (en) | Method for controlling a robot device and robot device controller | |
CN101673354A (en) | Operant conditioning reflex automatic machine and application thereof in control of biomimetic autonomous learning | |
Sacks et al. | Learning to optimize in model predictive control | |
Santucci et al. | Intrinsic motivation mechanisms for competence acquisition | |
Hussein et al. | Towards Trust-Aware Human-Automation Interaction: An Overview of the Potential of Computational Trust Models. | |
Seurin et al. | Don't do what doesn't matter: Intrinsic motivation with action usefulness | |
Luo et al. | RLIF: Interactive Imitation Learning as Reinforcement Learning | |
EP3793785B1 (en) | Method and device for the computer-aided determination of control parameters for favourable handling of a technical system | |
Paudel | Learning for robot decision making under distribution shift: A survey | |
Stahlhut et al. | Interaction is more beneficial in complex reinforcement learning problems than in simple ones | |
Laezza | Robot Learning for Manipulation of Deformable Linear Objects | |
Hong et al. | Adversarial exploration strategy for self-supervised imitation learning | |
Varkey et al. | Learning robotic grasp using visual-tactile model | |
Rahim et al. | Genetically evolved action selection mechanism in a behavior-based system for target tracking | |
US20210397143A1 (en) | Autonomous self-learning system | |
US20210390377A1 (en) | Autonomous self-learning system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20091209 |